Server Monitoring With Mon

Table of Contents

1. Introduction

This is a quick and dirty introduction to Mon, a server monitoring tool that can
be used to monitor any number of services running on any number of servers. Mon
is useful for a system administrator who needs to be notified as soon as a
server or network resource goes down, so that he can respond immediately and
have as little downtime as possible.

Mon’s strategy is to be highly modular, thereby allowing you to write whatever
monitors (programs that check a service’s status) and alerts (programs that let
you know when a service has gone down or come up) tickle your fancy.

Don’t worry if you don’t want to or know how to write such scripts, though,
because the likelihood is that someone has already contributed the monitor or
alert script you’re looking for. For the purposes of this introduction, I will
assume we don’t need to write our own. Otherwise I’d need to spend more than 30
minutes writing this presentation! Besides, if you’re at that level already, you
can just read the mon manpage and figure it out for yourself.

Mon can also listen as a server daemon on a particular port, which allows other
computers running some form of the mon client to contact the mon server when
something running on it is down, so that mon can do whatever it has to do to let
you know. This feature is called event trapping (or "traps" for short). This is
also beyond the scope of this presentation, but is not too difficult to
implement.

2. Getting Mon

The first thing you’ll need to do is download mon. You can find it at:

http://www.kernel.org/software/mon/

Once you have gotten the mon tarball, follow these steps:

  1. untar it to /usr/local/lib/mon (trust me, it kind of assumes that you will be
    putting it there).
  2. Move the contents of the etc/ directory into /etc/mon
  3. mkdir /var/state/mon for the mon state information
  4. touch /var/state/mon/disabled

Now you’ll want to download all the user contributed programs. There are the

Download them all into /usr/local/lib/mon and untar them into the directories
they untar into. Easy enough. Anyway, once they’re untarred, and you’re still
in /usr/local/lib/mon, you should move the resulting files as follows:

# mv monitors/*/* mon.d
# mv alerts/*/* alert.d

3. Configuring Mon

Now you’re ready to copy /etc/mon/example.cf to /etc/mon/mon.cf and edit
/etc/mon/mon.cf. This file is laid out as follows:

  1. Global options
  2. Hostgroup definitions (assigning names to sets of hosts)
  3. Watch definitions (defining what will be monitored for each host group)

Each watch definition consists of any number of service definitions. A service
definition defines one service type that you will be checking on the current
host group.

Each service definition consists of:

  1. Various service options, including the frequency with which to check, the
    monitor program to use for the check, and a description string for the
    service
  2. One or more period definitions that dictate how to behave if the monitors
    fail during various times of the day or week

Each of these period definitions consists of various options such as what alert
and upalert programs to use, and with what options, as well as options that
dictate how frequently to notify you if the service remains down, or how many
failures must occur before the alert is sent.

The cool thing is that instead of using /etc/mon/mon.cf, you can call it
/etc/mon/mon.m4 (and make sure to start mon with the "-c /etc/mon/mon.m4"
option), and mon processes the file with m4 before processing the mon directives
in the file. This can be useful for DEFINEs, so you can keep from having to
write the same email address, pager number, time interval over and over
throughout the file, but instead use the DEFINEd version of it. See the included
example.m4 for a good example of this.

4. User contributed scripts

There are so many user contributed scripts, that I figured I’d list them here so
you could see all of them.

The monitors:

asyncreboot.monitor

bootp.monitor

cpqhealth.monitor

dialin.monitor

dir_file_age.monitor

dns.monitor

file_change.monitor

flexlm.monitor

foundry-chassis.monitor

fping.monitor

freespace.monitor

ftp.monitor

hpnp.monitor

http.monitor

http_integrity.monitor

http_t.monitor

http_tp.monitor

http_tpp.monitor

https.monitor

icecast.monitor

imap.monitor

informix.monitor

informixdbspace.monitor

ipsec.monitor

ldap.monitor

lwp-http-post.monitor

mailloop.monitor

mon.monitor

msql-mysql.monitor

na_quota.monitor

netappfree.monitor

netsnmp-exec.monitor

netsnmp-freespace.monitor

netsnmp-proc.monitor
nntp.monitor

ntp.monitor

ntservice.monitor

phttp.monitor

ping.monitor

pop3.monitor

postgresql.monitor

printmib.monitor

process-full-command-line.monitor

process.monitor

radius.monitor

rd.monitor

reboot.monitor

remote.monitor

rpc.monitor

rptr.monitor

samba.monitor

seq.monitor

silkworm.monitor

smtp.monitor

smtp3.monitor

smtp_rt.monitor

snmp_interface.monitor

sqlconn.monitor

ssh.monitor

startremote.monitor

tcp.monitor

tcpch.monitor

telnet.monitor

traceroute.monitor

umn_mon.monitor

up_rtt.monitor

xedia-ipsec-tunnel.monitor

The alerts:

bugzilla.alert

file.alert

gnats.alert

hpov.alert

mail.alert
netpage.alert

qpage.alert

remote.alert

simplepage.alert

sms.alert
snapdelete.alert

snpp.alert

test.alert

trap.alert

winpopup.alert

Additionally, there is a cgi program in the cgi-bin package to create a status
webpage that the average joe can grok (and even use it to modify mon’s
parameters while it’s running). There is also a GUI configuration utility in the
utils package, but it uses Perl/Tk (yuck), and is not too good anyway, but is
worth a try.

Anyway, I have about -3 minutes left to write this, so here comes the example
configuration, and you can probably find whatever else you need in the manpage!

5. Example configuration

Download
this example configuration.

#####################################
# Global Options
#
basedir= /usr/local/lib/mon
alertdir= alert.d
mondir= mon.d
cfbasedir= /usr/local/lib/mon/etc
dep_behavior = m
dep_recur_limit = 10
dtlogging= no
histlength = 100
logdir= /var/log/mon
maxprocs= 20
pidfile= /var/run
randstart= 60s

#####################################
# Host Groups
#
hostgroup dns 66.20.234.14 66.20.234.15

hostgroup ftp ftp linux2 craq01 archer

hostgroup nntp news

hostgroup pop3 mail

hostgroup smtp mxqmail2 mxqmail1 mail craq01

watch dns

service dns
description Check DNS services
interval 1m
monitor dns.monitor -zone speedfactory.net -master ns.speedfactory.net
period wd {Sun-Sat}
alert mail.alert moshe@speedfactory.net

upalert mail.alert moshe@speedfactory.net
alertevery 30m

watch ftp
service ftp
description Check FTP servers
interval 5m
monitor ftp.monitor -p 21 -t 20

period wd {Sun-Sat}
alert mail.alert moshe@speedfactory.net
upalert mail.alert moshe@speedfactory.net
alertafter 2

watch nntp
service nntp
description Check that news server is up

interval 1m
monitor nntp.monitor -p 119
period wd {Sun-Sat}
alert mail.alert moshe@speedfactory.net
upalert mail.alert moshe@speedfactory.net
alertevery 30m

watch pop3

service pop3
description Check that the pop3 server is working
interval 1m
monitor pop3.monitor -p 110 -t 20
period wd {Sun-Sat}
alert mail.alert moshe@speedfactory.net

upalert mail.alert moshe@speedfactory.net
alertevery 30m

watch smtp
service smtp
description Check mail sending
interval 1m
monitor smtp.monitor -p 25 -t 20

period wd {Sun-Sat}
alert mail.alert moshe@speedfactory.net
upalert mail.alert moshe@speedfactory.net
alertevery 30m
alertafter 2

watch routers
service ping

description Ping our routers
interval 1m
monitor fping.monitor
period wd {Sun-Sat}
alert mail.alert moshe@speedfactory.net
upalert mail.alert moshe@speedfactory.net

alertevery 15m
alertafter 2

6. Resources