I have been asked to debug a problem with another office’s Grid Control 10.2.0.5 setup. In a nutshell users complained they could not log in, and those who were previously connected were kicked out. Hmm, so much for a briefing.
When troubleshooting Grid Control it is important to remember that Oracle’s monitoring product is made up of 3 parts:
- The User Interface, called Oracle Management Server
- The repository database
- An arbitrary large number of agents deployed to the monitoring targets
With the briefing in mind, I first checked the OMS for problems and sure enough, the OMS partition on was 100% full on the operating system level. Must have been a runaway logfile I thought and checked the $OMS_HOME. To my surprise there ware thousands of core.pid files. Things got even stranger… The oldest core files dated from Sunday, at which point lots of network changes took place. Coincidence? The Apache process’s error_log showed lots of handshake problems between OMS and targets which seemed to undermine the theory.
However let’s not be too quick, but rather scientific. I checked the core dump to see what process created it:
$ file core.22473 core.22473: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'httpd'
Yep, definitely an Apache file. With all these network problems of Sunday I decided to get a clean slate and restart OPMN:
$ cd $OMS_HOME/opmn/bin ./opmnctl stopall
After a few moments, the process has stopped gracefully. I cleared out the core files and started OPMN, firmly believing that this solved it. New processes would not have problems with the ssl handshake as the network problems were solved. But no, during the execution of “opmnctl startall” the core dumps returned. A closer look at the core dumps was required. So far I only did this with RDBMS core dumps, but that shouldn’t make a difference here. I again stopped opmn and all components with it and cleared out all core dumps but one which I moved to /tmp for analysis.
Core dumps can be inspected by the debugger, and on the Linux platform this is most often gdb. Using “script” to record my session, I opened the core dump within gdb:
$ script analysis.trc $ gdb $OMS_HOME/Apache/Apache/bin/httpd core.22473 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... warning: Can't read pathname for load map: Input/output error. Reading symbols from /opt/oracle/10.2.0/oms10g/lib/libdms2.so...done. Loaded symbols for /opt/oracle/10.2.0/oms10g/lib/libdms2.so Reading symbols from /lib/libpthread.so.0...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /usr/lib/libdb.so.2...done. Loaded symbols for /usr/lib/libdb.so.2 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 Reading symbols from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_onsint.so...done. Loaded symbols for /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_onsint.so Reading symbols from /opt/oracle/10.2.0/oms10g/lib/libons.so...done. Loaded symbols for /opt/oracle/10.2.0/oms10g/lib/libons.so Reading symbols from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_mmap_static.so...done. [lots of Apache modules later] Reading symbols from /opt/oracle/10.2.0/oms10g/Apache/modplsql/bin/modplsql.so...done. Loaded symbols for /opt/oracle/10.2.0/oms10g/Apache/modplsql/bin/modplsql.so Core was generated by `/opt/oracle/10.2.0/oms10g/Apache/Apache/bin/httpd -d /opt/oracle/10.2.0/oms10g/'. Program terminated with signal 25, File size limit exceeded. [New process 22473] [New process 22503] [New process 22502] #0 0xffffe405 in __kernel_vsyscall () (gdb)
At the prompt, enter bt full to get the stack trace. If nothing else you can upload this to Oracle Support and hope for the best.
(gdb) bt full #0 0xffffe405 in __kernel_vsyscall () No symbol table info available. #1 0x003de44b in write () from /lib/libpthread.so.0 No symbol table info available. #2 0x556d4fdf in config_log_transaction () from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_log_config.so No symbol table info available. #3 0x556d5048 in multi_log_transaction () from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_log_config.so No symbol table info available. #4 0x08055ff4 in run_method () No symbol table info available. #5 0x08056125 in ap_log_transaction () No symbol table info available. #6 0x0806ca06 in ap_process_request () No symbol table info available. #7 0x0806457a in child_main () No symbol table info available. #8 0x080647fa in make_child () No symbol table info available. #9 0x08064b05 in perform_idle_server_maintenance () No symbol table info available. #10 0x0806514d in standalone_main () No symbol table info available. #11 0x08065860 in main () No symbol table info available.
This didn’t mean anything to me except that the process probably issued some kind of system call and died. I did some more digging around without result until my eye caught this line I previously missed:
Program terminated with signal 25, File size limit exceeded.
Aha, that’s a bit of a give away. The Apache logs are in $OMS_HOME/Apache/Apache/logs. Which of them might cause the problem? Again, Linux provides the answer to the question:
[oracle@omshost logs]$ du -sk * | sort -n | tail 3144 ssl_engine_log.1260964800 3144 ssl_engine_log.1261008000 3144 ssl_engine_log.1261051200 3148 ssl_engine_log.1260187200 3148 ssl_engine_log.1260316800 3148 ssl_engine_log.1260360000 3148 ssl_engine_log.1260835200 3424 access_log.1271160000 300100 error_log 2099204 access_log
Those 2099204 k translate into approximately 2G. Would Apache choke on too large a logfile? I moved it out of the way and restarted OPMN. This time all was well! So why didn’t I find the error 25 straight away? I think when troubleshooting we often focus on the next step before fully digesting the output of the previous one. Also I didn’t expect anything useful in the output of the libraries. For me the lesson learned is:
- Don’t overlook the obvious
- Don’t get too distracted by what you currently do
- Keep the eyes open!