Martins Blog

Trying to explain complex things in simple terms

Thousands of core dumps in $OMS_HOME

Posted by Martin Bach on May 24, 2010

I have been asked to debug a problem with another office’s Grid Control 10.2.0.5 setup. In a nutshell users complained they could not log in, and those who were previously connected were kicked out. Hmm, so much for a briefing.

When troubleshooting Grid Control it is important to remember that Oracle’s monitoring product is made up of 3 parts:

  1. The User Interface, called Oracle Management Server
  2. The repository database
  3. An arbitrary large number of agents deployed to the monitoring targets

With the briefing in mind, I first checked the OMS for problems and sure enough, the OMS partition on was 100% full on the operating system level. Must have been a runaway logfile I thought and checked the $OMS_HOME. To my surprise there ware thousands of core.pid files. Things got even stranger… The oldest core files dated from Sunday, at which point lots of network changes took place. Coincidence? The Apache process’s error_log showed lots of handshake problems between OMS and targets which seemed to undermine the theory.

However let’s not be too quick, but rather scientific. I checked the core dump to see what process created it:

$ file core.22473
core.22473: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'httpd'

Yep, definitely an Apache file. With all these network problems of Sunday I decided to get a clean slate and restart OPMN:

$ cd $OMS_HOME/opmn/bin
./opmnctl stopall

After a few moments, the process has stopped gracefully. I cleared out the core files and started OPMN, firmly believing that this solved it. New processes would not have problems with the ssl handshake as the network problems were solved. But no, during the execution of “opmnctl startall” the core dumps returned. A closer look at the core dumps was required. So far I only did this with RDBMS core dumps, but that shouldn’t make a difference here. I again stopped opmn and all components with it and cleared out all core dumps but one which I moved to /tmp for analysis.

Core dumps can be inspected by the debugger, and on the Linux platform this is most often gdb. Using “script” to record my session, I opened the core dump within gdb:

$ script analysis.trc
$ gdb $OMS_HOME/Apache/Apache/bin/httpd core.22473
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /opt/oracle/10.2.0/oms10g/lib/libdms2.so...done.
Loaded symbols for /opt/oracle/10.2.0/oms10g/lib/libdms2.so
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /usr/lib/libdb.so.2...done.
Loaded symbols for /usr/lib/libdb.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_onsint.so...done.
Loaded symbols for /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_onsint.so
Reading symbols from /opt/oracle/10.2.0/oms10g/lib/libons.so...done.
Loaded symbols for /opt/oracle/10.2.0/oms10g/lib/libons.so
Reading symbols from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_mmap_static.so...done.
[lots of Apache modules later]
Reading symbols from /opt/oracle/10.2.0/oms10g/Apache/modplsql/bin/modplsql.so...done.
Loaded symbols for /opt/oracle/10.2.0/oms10g/Apache/modplsql/bin/modplsql.so
Core was generated by `/opt/oracle/10.2.0/oms10g/Apache/Apache/bin/httpd -d /opt/oracle/10.2.0/oms10g/'.
Program terminated with signal 25, File size limit exceeded.
[New process 22473]
[New process 22503]
[New process 22502]
#0  0xffffe405 in __kernel_vsyscall ()
(gdb)

At the prompt, enter bt full to get the stack trace. If nothing else you can upload this to Oracle Support and hope for the best.

(gdb) bt full
#0  0xffffe405 in __kernel_vsyscall ()
No symbol table info available.
#1  0x003de44b in write () from /lib/libpthread.so.0
No symbol table info available.
#2  0x556d4fdf in config_log_transaction () from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_log_config.so
No symbol table info available.
#3  0x556d5048 in multi_log_transaction () from /opt/oracle/10.2.0/oms10g/Apache/Apache/libexec/mod_log_config.so
No symbol table info available.
#4  0x08055ff4 in run_method ()
No symbol table info available.
#5  0x08056125 in ap_log_transaction ()
No symbol table info available.
#6  0x0806ca06 in ap_process_request ()
No symbol table info available.
#7  0x0806457a in child_main ()
No symbol table info available.
#8  0x080647fa in make_child ()
No symbol table info available.
#9  0x08064b05 in perform_idle_server_maintenance ()
No symbol table info available.
#10 0x0806514d in standalone_main ()
No symbol table info available.
#11 0x08065860 in main ()
No symbol table info available.

This didn’t mean anything to me except that the process probably issued some kind of system call and died. I did some more digging around without result until my eye caught this line I previously missed:

Program terminated with signal 25, File size limit exceeded.

Aha, that’s a bit of a give away. The Apache logs are in $OMS_HOME/Apache/Apache/logs. Which of them might cause the problem? Again, Linux provides the answer to the question:

[oracle@omshost logs]$ du -sk * | sort -n | tail
3144      ssl_engine_log.1260964800
3144      ssl_engine_log.1261008000
3144      ssl_engine_log.1261051200
3148      ssl_engine_log.1260187200
3148      ssl_engine_log.1260316800
3148      ssl_engine_log.1260360000
3148      ssl_engine_log.1260835200
3424      access_log.1271160000
300100    error_log
2099204   access_log

Those 2099204 k translate into approximately 2G. Would Apache choke on too large a logfile? I moved it out of the way and restarted OPMN. This time all was well! So why didn’t I find the error 25 straight away? I think when troubleshooting we often focus on the next step before fully digesting the output of the previous one. Also I didn’t expect anything useful in the output of the libraries. For me the lesson learned is:

  • Don’t overlook the obvious
  • Don’t get too distracted by what you currently do
  • Keep the eyes open!

6 Responses to “Thousands of core dumps in $OMS_HOME”

  1. Martin,

    I’d have expected the access_log to be rotated with the Apache rotatelogs directive. That’s the default in almost all versions of OHS, so I’m very suprised to find it not the case in Grid Control. The file that I have found doesn’t get rotated is error_log. On busy OHS systems that one can easily grow significantly fast enough that a 2gb logfile is top of my mental hit list for things to check when Apache fails :(.

    cheers

    Niall

    • Martin said

      You are very right, this is a bit strange. I’ll dig deeper and post an update as soon as I have one.

  2. Hello Martin,

    i stumbled upon this problem a long time ago as well.

    There is also a MOS Note 419999.1 “Problem: Core Dumps: Grid Control OMS generates huge amount of core.xxx files causing the system to halt”.

    Best regards,
    Martin

  3. […] https://martincarstenbach.wordpress.com/2010/05/24/thousands-of-core-dumps-in-oms_home/ […]

  4. Thanks for the article Martin

  5. OHS will be running multi-threaded (hint: you see mentions of pthread in the stack trace). For multi-threaded processes, a simple “bt” or “bt full” will not show you the thread that crashed (unless it was the first thread, which is unlikely). If you believed what you saw when using simple ‘bt’ when diagnosing Apache cores, you’d be forgiven for thinking most of the crashes where caused by the MPM or an ARP routine, where its more likely to be something else (eg. extra module, such as a Webgate)

    What you need to do is a ‘thread apply all bt’ (you can try adding ‘full’ but that’s not useful without the symbol tables, which you won’t have for OHS).

    Because a core was dumped, you can easily identify the thread that caused the crashed by looking for whichever thread was calling sig_coredump().

    ‘set pagination off’ is also useful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: