Quis custodiet ipsos custodies-Nagios monitoring for Grid Control

I have a strange problem with my Grid Control Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.

NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)

So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.

I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing.If you haven’t got a development environment, build it:

  • # yum install httpd
  • # yum install php
  • # yum install gcc glibc glibc-common
  • # yum install gd gd-devel
  • # yum install openssl-devel

From then on it’s as simple as copy-pasting from the quickstart guide. The only problem I had with the check_http plugin was the lack of openssl-devel. I initially built the plugins without --with-openssl=/usr/include/openssl flag. After executing the configure command again the build didn’t work for check_http (undefined symbol foo), but that could be fixed with a “make clean; make”. I just realised that my wordpress theme seems to combine two dashes into 1 – there is nothing I can do about that, sorry (frustrating in the case of the configure command etc)

For the remainder of this article I assume you built nagios with these arguments to configure:

./configure --with-command-group=nagcmd

The plugins have been built with these options:

./configure --with-openssl=/usr/include/openssl --with-nagios-user=nagios --with-nagios-group=nagios

This will install nagios to /usr/local/nagios which is fine by me-you’d obviously choose a different prefix when configuring for a production nagios server. Start the nagios server as per the quickstart guide using “service nagios start”.

With nagios up and running you can connect to the dashboard: http://nagios.example.com/nagios

You authenticate yourself using the nagiosadmin account and the password you supplied earlier to the htpasswd command.

Great! Your nagios environment is up and running. Next you need to add the OMS to the configuration. First of all I opted to change the example configuration-three steps are to be performed:

  • Modify the contacts
  • Create a check_oms command
  • Add the OMS to the list of monitored targets

Again, I should note that this setup is for monitoring 2 OMS hosts only, nothing else. I’m saying this because the way I add the targets is not the most elegant one. If you intend to add more targets to the nagios setup you should opt for a better approach which is commented out in the nagios.cfg file.

Modifying contact information

I would like to be informed in case something goes wrong. Nagios offers a wealth of notification methods, I’m limiting myself to email. The file you’d like to modify with your favourite text editor is /usr/local/nagios/etc/objects/contacts.cfg

The most basic (but sufficient) way is to edit the nagiosadmin contact. Simply change the email address to your email address and save the file. NB: you may have to configure your local MTA and add a mail relay-ask your friendly sys-admin how to do so.

Create the check_oms command

Before we can define it as a target in nagios, we need to tell nagios how to monitor the OMS. Nagios comes with a basic set of plugins, amongst which the check_http seems the most suitable. It needs to be compiled with the openssl-devel package (see above) since the OMS logon requires the https protocol.

Open /usr/local/nagios/etc/objects/commands.cfg with your favourite text editor and add a command such as this one:

define command{
command_name    check_oms
command_line    $USER1$/check_http -H $HOSTALIAS$ -f critical -w 5 -c 10 --ssl -p 7799 --url /em/console/logon/logon

Translated back to English this means that if the check_oms command is defined as a so called service check in nagios then the check_http script is called against the host defined by the host alias (we’ll define that in a minute) variable. Furthermore, if we receive http 302 codes (moved temporarily) I want the check to return a critical error instead of an OK. If my response time is > 5 seconds I want the service to emit a “warning” reply, and if it takes longer than 10 seconds than that’s critical. The remaining variables specify that I need to use SSL against port 7799 (default Grid Control port-change if yours is different) and the URL is /em/console/logon/logon. Don’t simply specify /em as the URL as that will silently redirect you to /em/console/logon/logon after a HTTP 302 message which doesn’t help in this case. You can run the command interactively on the nagios host. The check is in /usr/local/nagios/libexec; the “-v” option displays the HTTP traffic:

[root@nagios libexec]# ./check_http -H oms.example.com -f critical -w 5 -c 10 --ssl -p 7799 --url /em/console/logon/logon -v
GET /em/console/logon/logon HTTP/1.1
User-Agent: check_http/v1.4.15 (nagios-plugins 1.4.15)
Connection: close
Host: oms.example.com:7799
https://oms.example.com:7799/em/console/logon/logon is 8671 characters
**** HEADER ****
Date: Mon, 28 Feb 2011 10:27:14 GMT
Server: Oracle-Application-Server-11g
Set-Cookie: JSESSIONID=tJ0yNr4CgGf4gyTPJR4kKTzL2WBg1SFLQvh0ytrpC3Kgv9xqkDsF!-2069537441; path=/em; HttpOnly
X-ORACLE-DMS-ECID: 00074S^kp_dF8Dd_Tdd9ic0000B4000DkI
X-Powered-By: Servlet/2.5 JSP/2.1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Content-Language: en
**** CONTENT ****
[content skipped...]
HTTP OK: HTTP/1.1 200 OK - 8671 bytes in 0.067 second response time |time=0.067383s;5.000000;10.000000;0.000000 size=8671B;;;0
[root@nagios libexec]#

Right- HTTP 200 and sub-second response time: I’m happy.

Create a new monitored target

This is the final step to be completed. I started off by copying the localhost.cfg file to oms.cfg and edited it. Below is a sample file with all comments and other verbose information removed:

define host{
use               linux-server
host_name         OMS_pilot
alias             oms.example.com

define service{
use                     generic-service
host_name               OMS_pilot
service_description     HTTP
is_volatile             0
check_period            24x7
max_check_attempts      10
normal_check_interval   1
retry_check_interval    1
contact_groups          admins
notification_options    w,u,c,r
notification_interval   960
notification_period     workhours
check_command           check_oms

I’m also using the check_ping command but that’s the default and not shown here. How does nagios know what server to execute the check against? That’s back in the command definition. Remember the -H $HOSTALIAS$ directive? Upon execution of the check, the value of the host’s alias configuration variable will be passed to the check_oms command. You should therefore ensure that the nagios host can resolve that host name, and I’d recommend using the FQDN as well.

The service check will execute the check_oms command against the host every minute 24×7. In case the service is critical, it will notify the contact group admins (which you edited in step 1) and send email during work hours (09:00 – 17:00 by default, defined in timeperiods.cfg).

The final bit where everything is tied together is the nagios.cfg file: add the definition for your host as in this example:


Alternatively, if you would like to logically group your objects, you could create /usr/local/nagios/etc/servers and put all your server configuration files in there. Regardless what option you choose, the next step is to  reload the nagios service to reflect the current configuration.

(Ignore the warning-that’s a http 403 issue on another host …)

Happy monitoring!