Martins Blog

Trying to explain complex things in simple terms

Troubleshooting Grid Infrastructure startup

Posted by Martin Bach on March 17, 2011

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks.
  • Private bondn device defined on a VLAN
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong?An obvious reason for this could be a reboot, and fair enough, there has been one:

[grid@node1.example.com] $ uptime
09:09:22 up  2:40,  1 user,  load average: 1.47, 1.46, 1.42

The next step was to check if the local CRS stack was up, or better, to check what was down. Sometimes it’s only crsd which has a problem. In my case everything was down:

[grid@node1.example.com] $ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[grid@node1.example.com] $ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
CRS-4404: The following nodes did not reply within the allotted time:
node2,node3, node4, node5, node6, node7, node8

The CRS-4404 was slightly misleading, I assumed all cluster nodes were down after a clusterwide reboot. Sometimes a single node reboot triggers worse things. However, logging on to node 2 I saw that all but the first node were ok.

CRSD really needs CSSD to be up and running, and CSSD requires the OCR to be there. I wanted to know if the OCR was impacted in any way:

[grid@node1.example.com] $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
ORA-29701: unable to connect to Cluster Synchronization Service

Well it seemed that the OCR location was unavailable. I know that on this cluster, the OCR is stored on ASM. Common reasons for the PROC-26 error are

  • Unix admin upgrades the kernel but forgets to upgrade the ASMLib kernel module (common grief with ASMLib!)
  • Storage is not visible on the host, i.e. SAN connectivity broken/taken away (happens quite frequently with storage/sys admin unaware of ASM)
  • Permissions not set correctly on the block devices (not an issue when using asmlib)

I checked ASMLib and it reported a working status:

[oracle@node1.example.com] $ /etc/init.d/oracleasm status
Checking if ASM is loaded: yes
Checking if /dev/oracleasm is mounted: yes

That was promising, /dev/oracleasm/ was populated and the matching kernel modules loaded. /etc/init.d/oracleasm listdisks listed all my disks as well. Physical storage not accessible (PROC-26) seemed a bit unlikely now.

I could rule out permission problems since ASMLib was working fine, and I also rule out the kernel upgrade/missing libs problem by comparing the RPM with the kernel version: they matched. So maybe it’s storage related?

Why did the node go down?

Good question, usually to be asked towards the unix administration team. Luckily I have a good contact placed right inside that team and I could get the following excerpt from /var/log/messages arond the time of the crash (6:31 this morning):

Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192116, count = 1
Mar 17 06:26:06 node1 kernel: Aborting journal on device dm-2.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192216, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192166, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192122, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192140, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192174, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 10 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_truncate: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_orphan_del: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_delete_inode: Journal has aborted
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 kernel: ext3_abort called.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Mar 17 06:26:06 node1 kernel: Remounting filesystem read-only
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:41076
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:41076
Mar 17 06:26:09 node1 kernel: SysRq : Resetting
Mar 17 06:31:15 node1 syslogd 1.4.1: restart.

So it looks like a file system error triggered the reboot-I’m glad the box came back up ok on it’s own. The $GRID_HOME/log/hostname/alerthostname.log didn’t show anything specific to storage. Normally you would see that it starts counting a node down if it lost contact to the voting disks (in this case OCR and voting disks share the same diskgroup).

And why does Clusteware not start?

After some more investigation it seems there was no underlying problem with the storage, so I tried to manually start the cluster, traililng the ocssd.log file for possible clues.

[root@node1 ~]# crsctl start cluster
CRS-2672: Attempting to start 'ora.cssd' on 'node1'
CRS-2674: Start of 'ora.cssd' on 'node1' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'node1'
CRS-2681: Clean of 'ora.cssd' on 'node1' succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1'
CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'node1'

… the command eventually failed. The ocssd.log file showed this:

...
2011-03-17 09:47:49.073: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996354 ms, node 0x2aaab008a260 { host 'node4', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-a7491097, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.084: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996364 ms, node 0x2aaab008a630 { host 'node6', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-2f6ece1c, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.113: [    CSSD][1113332032]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846440, LATS 10996434, lastSeqNo 30846437, uniqueness 1300108895, timestamp 1300355268/3605443434
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355257, LATS 10996434, lastSeqNo 31355254, uniqueness 1300344405, timestamp 1300355268/10388584
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372473, LATS 10996434, lastSeqNo 31372470, uniqueness 1297097908, timestamp 1300355268/3605182454
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384686, LATS 10996434, lastSeqNo 31384683, uniqueness 1297098093, timestamp 1300355268/3604696294
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388819, LATS 10996434, lastSeqNo 31388816, uniqueness 1297098327, timestamp 1300355268/3604712934
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612975, LATS 10996434, lastSeqNo 29612972, uniqueness 1297685443, timestamp 1300355268/3603054884
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203293, LATS 10996434, lastSeqNo 31203290, uniqueness 1297156000, timestamp 1300355268/3604855704
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 3, node33, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355258, LATS 10996434, lastSeqNo 31355255, uniqueness 1300344405, timestamp 1300355268/10388624
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372474, LATS 10996434, lastSeqNo 31372471, uniqueness 1297097908, timestamp 1300355268/3605182494
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384687, LATS 10996434, lastSeqNo 31384684, uniqueness 1297098093, timestamp 1300355268/3604696304
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388821, LATS 10996434, lastSeqNo 31388818, uniqueness 1297098327, timestamp 1300355268/3604713224
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612977, LATS 10996434, lastSeqNo 29612974, uniqueness 1297685443, timestamp 1300355268/3603055224
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846441, LATS 10996474, lastSeqNo 30846438, uniqueness 1300108895, timestamp 1300355269/3605443654
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355259, LATS 10996474, lastSeqNo 31355256, uniqueness 1300344405, timestamp 1300355268/10389264
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203294, LATS 10996474, lastSeqNo 31203291, uniqueness 1297156000, timestamp 1300355269/3604855914
2011-03-17 09:47:49.619: [    CSSD][1116485952]clssnmSendingThread: sending join msg to all nodes
...

The interesting bit is the “BUT NO NETWORK HB”, i.e. something must be wrong with the network configuration. I quickly checked the output of ifconfig and found a missing entry for my private interconnect. This is defined in the GPnP profile if you are unsure:

<gpnp:Network-Profile>
 <gpnp:HostNetwork id="gen" HostName="*">
 <gpnp:Network id="net1" IP="17x.xx.x9.1x0" Adapter="bond1.251" Use="cluster_interconnect" />
 </gpnp:HostNetwork>
</gpnp:Network-Profile>

Now that’s a starting point! If tried to bring up bond1.251, but that failed:

[root@node1 network-scripts]# ifup bond1.251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
ERROR: could not add vlan 251 as bond1.251 on dev bond1

The “invalid argument” didn’t mean too much to me, so I ran ifup with the “-x” flag to get more information about which argument was invalid:

[root@node1 network-scripts]# which ifup
/sbin/ifup
[root@node1 network-scripts]# view /sbin/ifup
# turned out it's a shell script! Let's run with debug output enabled
[root@node1 network-scripts]# bash -x /sbin/ifup bond1.251
+ unset WINDOW
...
+ MATCH='^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$'
+ [[ bond1.251 =~ ^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$ ]]
++ echo bond1.251
++ LC_ALL=C
++ sed 's/^[a-z0-9]*\.0*//'
+ VID=251
+ PHYSDEV=bond1
+ [[ bond1.251 =~ ^vlan[0-9]{1,4}? ]]
+ '[' -n 251 ']'
+ '[' '!' -d /proc/net/vlan ']'
+ test -z ''
+ VLAN_NAME_TYPE=DEV_PLUS_VID_NO_PAD
+ /sbin/vconfig set_name_type DEV_PLUS_VID_NO_PAD
+ is_available bond1
+ LC_ALL=
+ LANG=
+ ip -o link
+ grep -q bond1
+ '[' 0 = 1 ']'
+ return 0
+ check_device_down bond1
+ echo bond1
+ grep -q :
+ LC_ALL=C
+ ip -o link
+ grep -q 'bond1[:@].*,UP'
+ return 1
+ '[' '!' -f /proc/net/vlan/bond1.251 ']'
+ /sbin/vconfig add bond1 251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
+ /usr/bin/logger -p daemon.info -t ifup 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
+ echo 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
ERROR: could not add vlan 251 as bond1.251 on dev bond1
+ exit 1

Hmmm so it seemed that the underlying interface bond1 was missing-which was true. The output of ifconfig didn’t show it as configured, and trying to start it manually using “ifup bond1” failed as well. It turned out that the ifcfg-bond1 file was missing and had to be recreated using the official redhat documentation. All network configuration files in Red Hat based systems belong into /etc/sysconfig/network-scripts/ifcfg-interfaceName. With the recreated file in place, I was back in the running:

[root@node1 network-scripts]# ll *bond1*
-rw-r--r-- 1 root root 129 Mar 17 10:07 ifcfg-bond1
-rw-r--r-- 1 root root 168 May 19  2010 ifcfg-bond1.251
[root@node1 network-scripts]# ifup bond1
[root@node1 network-scripts]# ifup bond1.251
Added VLAN with VID == 251 to IF -:bond1:-
[root@node1 network-scripts]#

Now I could try to start the lower stack again:

CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1'
CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'node1'
CRS-2676: Start of 'ora.cssd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'node1'
CRS-2672: Attempting to start 'ora.ctssd' on 'node1'
CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.evmd' on 'node1'
CRS-2676: Start of 'ora.evmd' on 'node1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'node1' succeeded
CRS-2679: Attempting to clean 'ora.asm' on 'node1'
CRS-2681: Clean of 'ora.asm' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'node1'
CRS-2676: Start of 'ora.asm' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded
[root@node1 network-scripts]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Brilliant-problem solved. This is actually the first time that an incorrect network config prevented a cluster I looked after from starting. The best indication in this case is in the gipcd log file, but it didn’t occur to me to have a look at is as the error was clearly related to storage.

8 Responses to “Troubleshooting Grid Infrastructure startup”

  1. Jay said

    Nice info and keep sharing such wonderful things about RAC

  2. Gaetano said

    Thanks for sharing interestingd informations.
    Regards

  3. Damian said

    Thank you very much for the note. I had a very similar problem, I’ve read this article and found the solution for my problem. In my case the bond1 went down, when I tried with the “ifup bond1” it worked, and ASM instance started automatically, and then the node started up to.

    The thing is that it happened few months ago the first time and last night again, so something is wrong with bond1, and I don’t know what can be. I Hope I can find the solution.

  4. Linda said

    Am having the same issue right now. How did you fix your network? you said you followed the documentation. What was that?

    • Martin Bach said

      Hi Linda,

      that’s the Red Hat documentation. In my case it wasn’t only the missing bond1, but also the VLAN configuration.

      Hope it helps,

      Martin

  5. Hi Martin, came across this post after having some ODA issues today. Nice post. The interesting bit (of course), after reading the post, if you every found out why the network configuration files where missing our were they damaged due to the EXT3 issues…?

    Grz

    Marco

    • Martin Bach said

      Hi Marco,

      I think the server has been reassigned to a different network segment with all sorts of weirdness that went with it. I assume a sys admin moved the file by accident; a lot of change went into the VLAN configuration.

      Martin

  6. Raj said

    Hi Martin
    Nice!!! Post Indeed
    I had one of the symptoms in the post same..”has a disk HB,but no network HB” and in my case the host would crash.
    The reason was different though, we had upgraded our kernel with some stringent security features and got hit by the issue explained by Riyaj here:
    http://orainternals.wordpress.com/tag/has-disk-hb/
    Thanks again for the post.
    regards
    —Raj

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: