NETS header NETS Homepage UCAR Homepage NCAR Homepage SCD Homepage NETS Homepage About NETS Work requests & support
  Browse NETS topics: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Nagios (NETS notes)

by Pete Siemsen and John Hernandez

Contents

Introduction

Nagios is an open-source program for monitoring networks. In March of 2006, NETS started using Nagios to monitor UCAR and FRGP networks. Before then, HP Openview was used. Nagios is simpler and cheaper than Openview.

For information about how the NCAR NOC monitors UCAR divisional machines, see the NCAB Host Monitoring Policy.

Some of what's described here was learned by reading Nagios System and Network Monitoring by Wolfgang Barth. There are other books on Nagios, I chose this one because it's recent.

To access Nagios

To access Nagios, go to http://nagman.scd.ucar.edu/nagios/

To access the backup Nagios server, go to http://fl-nagman.ucar.edu/nagios

Find out more detailed information about the relationship between nagman and fl-nagman.

Installation

On ~2007-07-01, the Nagios 2.4 tarball install was replaced with Nagios 2.6 from the Debian archive. As root:

apt-get install nagios2 nagios-plugins-basic nagios-plugins-standard nagios2-doc

Configure Apache

Apache2 is automatically installed as a dependency of nagios2. Edit /etc/nagios2/apache2.conf, and uncomment the references to Nagios 1.x.  This allows us to reach nagios at the same URL as before, rather than the Debian default of http://127.0.0.1/nagios2

Don't forget to reload the Apache config:

/etc/init.d/apache2 reload

Configure Web Authentication

Debian nagios2 is pre-configured to look for the htpasswd.users file in /etc/nagios2.

Add a user to the htpasswd.users file:

# htpasswd /etc/nagios2/htpasswd.users siemsen
New password:
Re-type new password:
Adding password for user siemsen

Configure Nagios

General

The bottom of http://nagman.scd.ucar.edu/nagios/docs/installing.html has a link to Configuring Nagios.

Of course, reading the HTML docs Configuring Nagios helps a lot. What follows is the first steps I took.

cgi.cfg -Configure the CGI behavior

Check that Nagios is running

Do not uncomment the line that defines the check_nagios program. It's supposed to let the CGIs check that the Nagios daemon is running. After some Googling, it seems that the program wasn't updated for Nagios version 2, so it won't work. Also, see my check_nagios notes.

Set

authorized_for_system_information=*
authorized_for_configuration_information=*
authorized_for_system_commands=siemsen
authorized_for_all_services=*
authorized_for_all_hosts=*
authorized_for_all_service_commands=*
authorized_for_all_host_commands=*

Context-sensitive help

If you set show_context_help to 1, Nagios will put a little question-mark on every CGI page. When the user clicks it, Nagios displays a window that says that context-sensitive help isn't available. Looks like this is a work in progress. To avoid annoying the user, I set it to 0.

nagios.cfg - the Main Configuration File

First, in the main configuration file named nagios.cfg, above the inclusion of the checkcommands.cfg file, there are comments suggesting that I use the checkcommands.cfg file that came with the plugins. But I couldn't find such a file. There's a command.cfg in the plugins distribution, but it's syntax isn't right for this. Dunno what I'm supposed to do...

It complained that all the commands defined in minimal.cfg were already defined. They were, in checkcommands.cfg. I commented-out the reference to checkcommands.cfg in nagios.cfg, and then commented-out the definitions of notify-by-email and host-notify-by-email in minimal.cfg, and the verify command passed!

To allow use of "external commands" from the CGI scripts (like "Disable notifications for this host", Disable active checks of this host", etc.), you first have to do

check_external_commands=1
killall -HUP /usr/sbin/nagios2

This isn't enough - you also have to set permissions on the (named pipe) file in /var/lib/nagios2/rw/. I mostly followed the directions in http://nagios.sourceforge.net/docs/2_0/commandfile.html, but had to add an extra command to set the protection on the file itself to get it to work:

<as root>
groupadd nagiocmd
usermod -G nagiocmd nagios
usermod -G nagiocmd www-data
chown nagios.nagiocmd /var/lib/nagios2/rw
chown nagios.nagiocmd /var/lib/nagios2/rw/nagios.cmd
chmod u+rwx /var/lib/nagios2/rw
chmod g+s /var/lib/nagios2/rw
chmod o-rwx /var/lib/nagios2/rw
ls -ald /var/lib/nagios2/rw
/etc/init.d/apache2 restart
/etc/init.d/nagios2 restart

ncar-commands.cfg - command definitions

By default, Nagios uses the plugin named check_ping to actually ping hosts. The plugin named check_icmp is supposed to be more efficient than check_ping, as described in the Nagios book on page 88.

The check_icmp distributed with Debian nagios2 works as a drop-in replacement, and we now use it.

ncar.d - the NCAR configuration directory

/etc/nagios2/ncar.d is our custom configuration directory.  It is referenced from nagios.cfg.  All *.cfg files placed here will be included in the nagios configuration.

ncr.cfg

I copied minimal.cfg to /etc/nagios2/ncar.d/ncar.cfg and changed the reference to it in nagios.cfg. Now I'm working with just NCAR-specific configuration commands. From here on in this section, the changes all refer to the ncar.cfg file. I began the development loop described above.

As of July 2007, /etc/init.d/ncar.d contains:
hosts/ - auto-configuration scripts
ncar-ap.cfg - WLAN access points
ncar-bgp.cfg - BGP service
ncar.cfg - NCAR general network devices
ncar-command.cfg - custom commands (how to execute plugins)
ncar-contacts.cfg - contacts
ncar-env.cfg - environmental monitors (aka weathergeese)
ncar-hostgroups.cfg - hostgroups defined
ncar-security.cfg - security hosts
ncar-servicegroups.cfg - servicegroups defined
ncar-services.cfg - services defined
ncar-ups.cfg - UPS's
ncar-wan.cfg - FRGP, UPoP, BiSON
servers-acd.cfg
servers-cgd.cfg
servers-comet.cfg
servers-eol.cfg
servers-fanda.cfg
servers-globe.cfg
servers-hao.cfg        ---  servers*.cfg are autogenerated by hosts/populate-hostgroup.pl
servers-joss.cfg
servers-mmm.cfg
servers-nets.cfg
servers-ral.cfg
servers-scd.cfg
servers-unidata.cfg
servers-vislab.cfg

This is a description of how to autogenerate server configs

generate-nagios-configs script

When Nagios raises an alarm, the UCAR NOC relies on web pages to explain what to do about it. There is a web page for each host that Nagios monitors, stored in the /var/www/noc directory on nagman. We monitor over 400 hosts, so there are more than 400 web pages.

The web page for each host describes:

Maintaining 400 web pages became a royal pain, so we wrote software to help. A bash script named /etc/nagios2/hosts/generate.sh creates new configs and web pages for all the hosts that Nagios monitors. When we make changes, we run the script and then restart Nagios as described below.

The bash script runs a Perl program named generate-nagios-configs.pl multiple times, once for each hostgroup. You can run generate-nagios-configs.pl for a single hostgroup, but we usually run generate.sh to create all the hostgroups at once. You can run "generate-nagios-files.pl -?" to see the syntax used to run the program.

Unlike other programs maintained by NETS, generate.sh isn't run by a cron job. You must run it manually whenever you want to change the hosts that are monitored by Nagios or the web pages used by Nagios.

The program creates config files and web pages. It writes configs into a temporary directory. To make Nagios use the configs, you must manually copy them to the live Nagios config directory (/etc/nagios2/ncar.d). Then you run a Nagios command to verify that the configs are legal, and then you restart the Nagios process. This manual process protects Nagios from problems should the program generate bogus configs. The program also generates web pages, which it writes directly into the directory that holds the active Nagios web pages (/var/www/noc/).

Each time the program is run, it reads a files containing a list of hosts that are to be monitored by Nagios. There is one such file for each hostgroup in Nagios. The files are named "host-*.txt", where the asterisk is the hostgroup name. For example, the file named host-upop.txt contains all the hosts in the UPOP hostgroup.

The program generates a web page for each host name found in the input files. Basically, each web page has the following sections:

  1. DOCTYPE header
  2. a warning comment to hopefully keep people from manually editing the page
  3. NETS standard header
  4. host description
  5. host severity
  6. instructions for what to do when the host goes down
  7. a link to the Nagios webpage for the hostgroup
  8. NETS standard trailer

The sections of the output web pages are automatically generated, but some can be overridden to provide more specific information. For example, there may be specific instructions for a given host. If there are no specific instructions for the host, the program looks for group-specific instructions. If it doesn't find those, it looks for a group-specific list of people to contact. This scheme makes it possible to use generic instructions for, say, all the access points, while allowing for very specific instructions for, say, sabae.

For a given host, if there is a "description" file, the program inserts its contents into the output file. For example, for the host named sabae listed in the host-security.txt, the program looks for a file named /var/www/noc/sabae-description.shtml to fill in the description section of the web page that it writes. If there is no file named /var/www/noc/sabae-description.shtml, the program puts nothing in the description section of the output web page. To fill in the instructions section of the web page, the program looks for a file named /var/www/noc/sabae-instructions.shtml, and failing that, for a file named /var/www/noc/Security-instructions.shtml, and failing that, for a file named /var/www/noc/Security-support-personnel.shtml. A file named "*-support-personnel.shtml" must exist for every hostgroup.

In the files that specify lists of hosts, there is one host name on each line. The program will use DNS to resolve the name into an IP address that is written to the Nagios config file. Following the host name is an optional severity for the host - 1, 2 or 3. If no severity is specified, the default is 3, the lowest severity. If there are two numbers following the host name, the first is the Solitary severity (when the host is down but its alternate is up) and the second is the Collective severity (when all the alternates are down at the same time). This scheme is meant to be used for cases where there are two paths to a host, like csu-router-bison-a and csu-router-bison-b. The program writes a boilerplate paragraph that explains the severity.

A slightly different case is that of "high-availability" hosts, like "mscan". For these hosts, there are several low-severity hosts that collectively make up "mscan". As long as one of mscan1, mscan2, mscan3 or mscan4 is up, then "mscan" is up. For this case, we monitor each of mscan, mscan1, mscan2, etc. Mscan is severity 1 and the others are severity 3. The program does nothing special for these hosts. We use simple severities in the "host-*.txt" files, with description files that have the same contents, explaining how the aggregate works.

Hosts

To monitor a network node, you have to create a "host" entry.

Services

In general, Nagios is designed to check "services" like web servers, DNS servers, mail servers, etc. When a service doesn't respond, Nagios does a "host check" to see if the host itself is up. Host checks use ping, because pings are the most likely thing to work, since they're implemented by the machine's kernel.

At UCAR, we used to use Openview, which didn't have a concept of "services". It simply pinged hosts. Nagios isn't really built to do that. To make it do it, I use pings for both services checks and host checks. So Nagios does ping "service checks" and when pings fail, it does a ping "host check".

BGP services

From: John Hernandez
Date: Fri Sep 14 2007 - 16:53:04 MDT
To: ne
Subject: Nagios BGP service checks

Hey NE,

I implemented a new form of BGP service checks for Nagios. This should eliminate the problems we had with routers being unresponsive when faced with many simultaneous SNMP queries, and the Nagios event logs filling up with junk.

The method uses a new (to us) concept in Nagios called "passive service checks." A cron job (running as root) runs /usr/lib/nagios/plugins/passive_check_bgp.sh every 60 seconds. The program collects information from the routers by connecting to each router once (roughly speaking). The program submits the results to Nagios. Nagios does not initiate any checks itself, hence the term "passive."

Nagios is configured to continuously scan its "command file" for these passive BGP service updates. If BGP data is not refreshed within 120 seconds, Nagios will issue a stale data warning (service turns yellow). Other than this, the functionality is basically identical to what we had before. The NOC shouldn't notice any difference other than fewer transient false positives.

-John

So for example, to change the number of routes that Nagios expects the FRGP router to receive from the UCD, log on to nagman and edit /etc/nagios/bgp.d/frgp-gw-1-bgp.conf and change the "UCD" line. Then wait for 60 seconds for the cron job to run again, and Nagios should reflect your changes.

Chassis services

From: John Hernandez (jph@ucar.edu)
Date: Wed Oct 17 2007 - 15:53:40 MDT
To: ne@ucar.edu
Subject: Nagios "Chassis" service

Hey NE,

Nagios now checks the power supply status, fan status, and module status for all NCAR and FRGP Cisco 6500s. The service is called "Chassis". The plugin output includes the number of checks performed, chassis serial number, and if applicable, a list of any failed components.

-John

Interfaces services

From: John Hernandez (jph@ucar.edu)
Date: Wed Oct 31 2007 - 16:39:02 MDT
To: ne@ucar.edu
Subject: Yep, more Nagios updates

Hey NE,

A couple of new Nagios tweaks to report:

  1. The "Interfaces" service has been expanded to cover the fiber-capable modules (those with GBIC/SFP/Xenpack ports, including supervisors) on campus core switches. This should allow us to detect when half of an etherchannel uplink is down. For now, the modules to be checked are specified in the Nagios configuration file. We also talked about having it ignore ports that are in spanning-tree portfast instead. Any opinions one way or the other?
  2. I created some "Servicegroup" definitions. You can now click on Servicegroup Summary in the left pane to see a summary of BGP, Catalyst Chassis, DNS, HTTP, Interfaces, and OSPF service checks.
  3. For Pete - to keep things more manageable, I broke out the services configuration into its own file, ncar-services.cfg

-John

OSPF service

Dependencies

To make Nagios be intelligent about figuring out what's wrong, you should set the optional "parents" field in each host definition. You can Nagios's "Status Map" to see a fairly useless drawing that shows what depends on what. I was initially confused about whether this should reflect Layer 2 dependencies or layer 3 dependencies. The revelation is that Nagios doesn't care - it uses dependencies to be smart about what's down. It doesn't use them until a node is down, then it looks up the dependency tree to figure out what's really down. We implement dependecies at Layer 3. The generate-nagios-configs.pl script uses traceroute commands to figure out how to set the parents values.

Timing of host polls

We are currently configured to ping all hosts every 3 minutes.

Graphics - Status Map and NEXSM

useless Nagios Status Map

The graphics capability in Nagios is rudimentary compared to OpenView. Actually, it's useless because there are too many things on the map. If I can fix this, then it'll be worth visiting http://www.nagios.org/download/ to get nicer icons.

NEXSM - a better graphics map

NEXSM is a Java application that provides a graphical view of Nagios elements. To install it, I followed the instructions at the NEXSM Web site, and did this on fl-nagman:
  1. Installed Java 1.5 per the instructions at https://jdk-distros.dev.java.net/debian.html except that I installed sun-java5-sdk instead of sun-java5-jre.
  2. Installed PHP into our Apache per the instructions at http://www.debian-administration.org/articles/391 for Apache 2 and PHP5: The first did some installs, the second said libapache2-mod-php5 was already installed. I tested it by creating a web page that has PHP and looking at it with a browser, like so:
    cat >/var/www/nets/test.php
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
      <head>
        <title>NETS Web</title>
      </head>
      <body bgcolor="white">
    
    PHP is at version <?php phpinfo() ?>
    
    </body>
    </html>
    ^D
    
  3. Installed ant with
    apt-get ant
  4. Installed colt by getting http://dsd.lbl.gov/~hoschek/colt-download/releases/colt-1.2.0.zip
    Put colt/lib/colt.jar into the ../nexsm-1.3/libs directory.
  5. Installed JUNG by going to http://jung.sourceforge.net/download.html
    Download the jung-1.7.6.jar into the ../nexsm-1.3/libs directory.
  6. Installed Jakarta Commons by going to http://jakarta.apache.org/commons/collections/
    Copy the file commons-collections-3.2.1.jar to the ../nexsm-1.3/libs directory
  7. From the INSTALL file in the NEXSM tarball, I chose option 2, which is supposed to be easier than option 1. I did:
    fl-nagman$ keytool -alias nexsm -genkey -keystore nexsm-keystore
    Enter keystore password:  ucarnagios
    What is your first and last name?
      [Unknown]:  Pete Siemsen
    What is the name of your organizational unit?
      [Unknown]:  Computational & Informational Systems Laboratary
    What is the name of your organization?
      [Unknown]:  National Center for Atmospheric Research
    What is the name of your City or Locality?
      [Unknown]:  Boulder
    What is the name of your State or Province?
      [Unknown]:  Colorado
    What is the two-letter country code for this unit?
      [Unknown]:  US
    Is CN=Pete Siemsen, OU=Computational & Informational Systems Laboratary, O=National Center for Atmospheric Research, L=Boulder, ST=Colorado, C=US correct?
    [no]:  yes
    
    Enter key password for 
    	(RETURN if same as keystore password):  
    fl-nagman$
    					
  8. I edited nexsm.properties and changed store.password to "ucarnagios". I also changed "collections.3.1" to "collections.3.2.1".
  9. $ ant sign-jar Buildfile: build.xml -pre-compile: compile: [javac] Compiling 118 source files to /home/siemsen/nexsm-1.3/build/classes [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. -post-compile: [copy] Copying 70 files to /home/siemsen/nexsm-1.3/build/classes consolidate: [unjar] Expanding: /home/siemsen/nexsm-1.3/libs/colt.jar into /home/siemsen/nexsm-1.3/build/classes [unjar] Expanding: /home/siemsen/nexsm-1.3/libs/commons-collections-3.2.1.jar into /home/siemsen/nexsm-1.3/build/classes [unjar] Expanding: /home/siemsen/nexsm-1.3/libs/jung-1.7.6.jar into /home/siemsen/nexsm-1.3/build/classes [unjar] Expanding: /home/siemsen/nexsm-1.3/libs/qdxml.jar into /home/siemsen/nexsm-1.3/build/classes [delete] Deleting directory /home/siemsen/nexsm-1.3/build/classes/META-INF [delete] Deleting directory /home/siemsen/nexsm-1.3/build/classes/samples jar: [jar] Building jar: /home/siemsen/nexsm-1.3/html/jar/nexsm.jar sign-jar: BUILD FAILED /home/siemsen/nexsm-1.3/build.xml:91: The type doesn't support the "preservelastmodified" attribute. Total time: 12 seconds $
  10. So I edited build.xml and deleted the offending line. It worked! It created ../nexsm-1.3/html/jar
  11. cp -R html /var/www/nexsm
  12. su
    cd ~/nexsm-1.3
    cp -R etc/nagios/nexsm /etc/nagios
  13. mkdir /var/www/images
    mkdir /var/www/images/share
    I got some icons from nagiosexchange.
    cp icons/* /var/www/images/logos/
  14. The file /etc/mimi.types already had a line for "application/x-java-jnlp-file", so I skipped that step in the nexsm INSTALL file.
  15. Restarted the web server with
    /etc/init.d/apache2 restart
  16. used a web browser to visit http://fl-nagman.ucar.edu/nexsm/nexsm.php It up NEXSM, which asked me for a username and password, but the main canvas was empty.

Configure email aliases

The /etc/nagios2/ncar.d/ncar.cfg config file has a contact named nagios-admin. To make email sent to nagios-admin be forwarded to siemsen@ucar.edu, I added a line that says "nagios-admin: siemsen@ucar.edu" to /etc/aliases, and then ran the newaliases command.

Location icons

To help group problems, it's useful to know where a host is. For example, if there are 5 hosts red and they are all in FL2, you might deduce that there is a power problem in FL2. Nagios has the ability to display small icons next to device names. For example, it displays a little red folder icon to indicate that there is a web page for a host, and it displays a little cloud icon to indicate that there are comments associated with the host. You can configure Nagios to display other icons to identify features of a host. For example, you might display a little penguin to indicate that the host runs Linux. We use this feature to display our own custom icons to indicate a host's location.

Icons for use in Nagios should be 40x40 pixels. I created them using Omnigraffle. I did New and set the type to "48x48". Then in the Canvas  Size inspector, I changed the sizes to 40pt x 40pt. Then I created a new text object with a value of "ML" and set it to Helvetica, Bold, 12. Then I positioned it and did File -> Export. I set the file name to "ml" and set the Format to "GIF bitmap image" (because it permits transparent backgrounds) and clicked "Transparent Background" and clicked Save. Then I changed things, trying to get readable icons. They are apparently resized by the web browser anyway, so it's hard to get pleasant effects.

To get Nagios to display the icons next to a host name, use a "icon_image" parameter in the host's hostextinfo object. Nagios stores icons in /usr/share/nagios2/htdocs/images/logos, so I created a new subdirectory there named "ncar". Then I added a line to the hostextinfo object for the host named netserver so it looked like this:

define hostextinfo {
host_name netserver
icon_image ncar/ml.png
notes_url /noc/netserver.shtml
}

The icons are:

Running Nagios

To start the Nagios daemon from scratch:

(as root)
/etc/init.d/nagios2 start

With the daemon running, to view the user interface, web to http://nagman.scd.ucar.edu/nagios/

To cause the nagios daemon to re-read its configuration files:

/usr/sbin/nagios2 -v /etc/nagios2/nagios.cfg
killall -HUP /usr/sbin/nagios2

Severities

The UCAR NOC uses "severity levels" to determine the importance of service problems. There are at least three places that the concept of "severity" is defined:

Luckily, these three definitions are similar!

Notification

Nagios WebUI plays a "horn" wav file when a system goes down.  This is specified in the cgi.cfg and is the primary method CPG relies on to know when they need to respond.

We use the Nagios notification system to send a Win popup message to the main CPG workstation when a machine recovers.

Event Log

Nagios rotates its own logs daily, which are located in /var/log/nagios2. The current log is nagios.log. Old logs go into the archives/ directory. These can be browsed from the web interface with the event log viewer.

Pros and cons of Nagios

SNMP

snmp4nagios

To check an SNMP variable with Nagios, you can use check_snmp or you can write your own plugins. The latter may be better, because the name of the plugin provides some documentation, and the plugin itself can check the range of values. If you use the latter choice, steal like mad from snmp4nagios.

as root
cd /usr/src
gunzip snmp4nagios-0.1.tar.gz
tar xf snmp4nagios-0.1.tar
rm snmp4nagios-0.1.tar
cd snmp4nagios-0.1
more INSTALL
mkdir -p /usr/local/nagios/perflog/rrd
mkdir -p /usr/local/nagios/perflog/img
mkdir -p /usr/local/nagios/libexec/snmp4nagios

At this point, you're ready to "make install", but it'll fail because it can't find rrd.h. You have to install rrdtool-devel before you can continue.

SNMP traps

SNMP traps can be handled with a Nagios "passive service check". Need to investigate this.

Jeff installed the Net-SNMP package on nagman, and sent me some notes about it. He suggested that I consider these:

VERY GOOD: http://www.samag.com/documents/s=9559/sam0503g/. I made a copy of it at SNMP-traps.html.

http://www.snmptt.org/

See syslog-nagman.shtml and net-snmp-nagman.shtml

Bugs

Some browsers "hang", which means they initially work fine, but after a while, when you try to leave a page, the browser doesn't go to the new page. Instead, it turns the cursor to an hourglass, and when the 60-second timeout happens, then the browser changes pages. This happens with Firefox and IE under Windows XP on BRONCO, the NOC Nagios machine. I'm trying Opera on that machine as of 2006-03-03, hope it helps.

Other links and resources

Monarch

Maintaining config files by hand can be a pain. Monarch is an open-source program that edits Nagios config files. It's available on SourceForge at http://sourceforge.net/project/showfiles.php?group_id=130574