Martins Blog

Trying to explain complex things in simple terms

Interesting GNS anomaly in 12.1.0.1.2 RAC

Posted by Martin Bach on March 13, 2014

I was encountering an interesting anomaly with my 12.1.0.1.2 RAC cluster based on Grid Naming System. I have written about the setup here.

No need to get back to the post-instead let me give you some background if you are not familiar with the Grid Naming System in RAC. Most RAC installations use a Domain Name Server (DNS) to resolve names to IP addresses. In the case of GNS the name resolution is delegated from the main DNS server to a so-called subdomain. In my case my DNS server (bind version 9) is configured on aux.example.com (192.168.100.2), and it delegates name resolution for *.gns.example.com (my cluster) to the Grid Naming System. The GNS address in use is 192.168.100.37 and must be defined in the corporate DNS. The required IP addresses have to be supplied by a DHCP server in the same range as the public network. My public network is on 192.168.100/24, the private network is on 192.168.101/24 and 192.168.102/24. Which component resolves what? Here is a summary:

  • The GNS VIP is resolved by DNS. It’s not in the subdomain according to the documentation. I am using gns-vip.example.com (192.168.100.37) for it.
  • The node public IP is resolved via GNS. Oracle uses the $(hostname) to assign the name. The IP address is fixed
  • The node VIPs are resolved via GNS, the assigned name is $(hostname)-vip and the addresses are assigned by DHCP
  • The private interconnect interface is resolved by GNS, usually as $(hostname)-priv. Addresses are either static or assigned by DHCP
  • The SCAN VIPs are resolved via GNS and assigned by DHCP. And this is exactly where my problem was.

The Problem

All over sudden I had connection problems with my tomcat server connecting to the database. The address I used in tnsnames ora was rac12scan.gns.example.com and hadn’t changed at all. It simply didn’t point to a valid IP half of the time. When I pinged the SCAN I was surprised to see that it listed adresses that weren’t in the output of “srvctl config scan”.

And sure enough, a nslookup against the SCAN resulted in the output of 7 (!) addresses. There was obviously a discrepancy between what Oracle thought the SCAN was (3 IP addresses) and what GNS reported. I tried various cures, including restarting the DNS and DHCP servers (even moving the lease file on the DHCP server to a temporary location), and variations of restarting GNS and the whole cluster (crsctl stop clsuter -all; crsctl start cluster -all). Nothing helped.

Troubleshooting

As you’d expect the name resolution only works with GNS started. If you try to “nslookup rac12scan.gns.example.com” while GNS is down, you get the NXDOMAIN result code which does not really help. The output of “srvctl status gns -list and -detail” is useful in troubleshooting the problem. I also ran “cluvfy comp gns” but that came back clean. Nevertheless, the system instisted that the SCAN was made up of 7 IP addresses, and only 4 of them had SCAN VIPs. No surprise I can’t connect half the time. Now the question is of course: why does the GNS daemon grab so many IP addresses? So I looked at the gnsd.log in $GRID_HOME/log/$(hostname -s)/gns to see what happens. Here is an excerpt from the log file:

2014-03-11 16:37:30.305: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01305:) #0 of 3: OCR error 4 (PROC-4: The cluster registry key to be operated on does not exist.) is not expected 32 - no retry will be done.
2014-03-11 16:37:30.305: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01307:) Pass #0: open of "SYSTEM.GNS.SUBDOMAIN.gns|dexample|dcom.rac12gns-scan1-vip.SRV" failed with error 4 Last: PROC-4: The cluster registry key to be operated on does not exist. (4) - unable to retry.
2014-03-11 16:37:30.306: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01305:) #0 of 3: OCR error 4 (PROC-4: The cluster registry key to be operated on does not exist.) is not expected 32 - no retry will be done.
2014-03-11 16:37:30.306: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01307:) Pass #0: open of "SYSTEM.GNS.SUBDOMAIN.gns|dexample|dcom.rac12gns-scan1-vip.TXT" failed with error 4 Last: PROC-4: The cluster registry key to be operated on does not exist. (4) - unable to retry.
2014-03-11 16:37:30.306: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12node1-vip" Type: A 192.168.100.44 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.308: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12node2-vip" Type: A 192.168.100.39 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.308: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.34 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.36 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.35 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.40 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.42 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.41 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.31 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan2-vip" Type: A 192.168.100.41 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan2-vip" Type: A 192.168.100.34 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan3-vip" Type: A 192.168.100.40 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan3-vip" Type: A 192.168.100.38 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan1-vip" Type: A 192.168.100.31 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [   CLSNS][1009862208]Resolve::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.310: [   CLSNS][1009862208]Resolve::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.311: [   CLSNS][1009862208]Resolver #0::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.311: [   CLSNS][1009862208]Resolver #0::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.311: [   CLSNS][1009862208]Resolver #1::clsns_SetTraceLevel:trace level set to 1.

You can see the address assignment here, and this corresponds to the DHCP leases I saw on the DHCP server host’s /var/log/messages file. Notice that rac12scan grabs 7 IP addresses. This is surprising, the documentation normally states just 3 IP addresses for use with the SCAN. As you can further see the SCAN VIPs resolve only to a subset of these. 192.168.100.36 for example does not have a corresponding SCAN-VIP like some others too. Also, rac12gns-scan2-vip and rac12gns-scan3-vip appear twice. So if rac12scan now resolves to 192.168.100.36 you will get a TNS Error in the form “Destination host does not exist”. Annoying! But why does it do so? The mesages before (PROC-4) gave me a clue.

The information about GNS VIPs seems to be stored in the OCR. An ocrdump confirmed the findings. The question remains: why do Oracle use more than 3 IP addresses for the SCAN? I can only speculate that the DHCP addresses obtained are added to the OCR, and the existing ones don’t seem to be cleared out or updated correctly.

In an effort to solve the problem I dropped the SCAN listeners and the SCAN VIPS and recreated them. But this seemed to confuse the cluster even more, and the name resolution didn’t work. I tried both the subdomain delegation (nslookup rac12scan.gns.example.com) and querying GNS directly (nslookup rac12scan.gns.example.com – 192.168.100.37), both did not resolve anything. A restart of Clusterware didn’t change the situation so I decided to reboot the cluster nodes hoping to bring everything back to normal (yes I should know better, this isn’t Windows). After some waiting until the servers come back up there was no change. Now the system was so confused that it didn’t come up with a SCAN at all. Trying to restore the OCR failed, and I couldn’t even run a rootcrs.pl -deconfig -force in preparation for a reinstallation of Clusterware. So I trashed the system. I don’t believe there is a time for GNS in my lab after this experience.

Thanks to Marcin Przepiorowski (@pioro) who helped during the investigation.

References:

Grid Infrastructure Installation Guide 12c:

  • 1.1.3 Oracle Grid Infrastructure Network Checklist
  • 5.5.1 About Oracle Grid Infrastructure Name Resolution Options
  • 5.11 Grid Naming Service Standard Cluster Configuration Example
About these ads

One Response to “Interesting GNS anomaly in 12.1.0.1.2 RAC”

  1. Rick Wessman said

    From the evidence, I don’t think that GNS is the problem. When SCAN comes up, the agents register their addresses with GNS. It looks like they are not removing the old ones first.

    You are correct that GNS stores the record values in OCR. That way, if it fails, the new instance can re-register the current records without forcing all of the advertisers (the agents) to do it.

    If you still have the logs, I suggest contacting Oracle Support about it. This issue should be fixed.

    Thanks,
    Rick

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 2,399 other followers

%d bloggers like this: