RAC service weirdness

I ran into a problem starting one out of 14 services registered to my database this weekend. All of it happened after a restart of the servers following a redhat upgrade from 5.3 to 5.4. One of the servers had a file system corruption which required some extensive fsck’ing so only 2 out of 3 nodes were started-this implies that some of the services started on the available node rather than the preferred one.

Not a biggie in this situation, after all we are interested in restoring service to the users. After a couple of hours the first node eventually finished the file system check and came up. CRS automatically started the instance and all looked good. Since we were still in the downtime window I decided against checking each of the 14 services to see if they run on the correct node but rather decided to stop them all and restart them-the intention was to have them all restart on the correct node.

The commands to do so are rather simple:

[oracle@nodeA ~]$ srvctl stop service -d prod
[oracle@nodeA ~]$ srvctl start service -d prod
PRKP-1030 : Failed to start the service someService.
CRS-0215: Could not start resource 'ora.prod.someService.cs'.
[oracle@nodeA ~]$

These commands can easily take a while to run, which I expected. I didn’t expect the service someService to fail. Hmphh, just try to start only this service then (remember all cluster nodes were up!)

[oracle@nodeA ~]$ srvctl start service -d prod -s someService
CRS-0215: Could not start resource 'ora.prod.someService.cs'.

Now that’s strange. I checked the node’s crsd.log for clues:

2010-02-20 14:43:47.020: [  CRSRES][1492207936]0startRunnable: setting CLI values
2010-02-20 14:43:47.028: [  CRSRES][1492207936]0Attempting to start `ora.prod.someService.cs` on member `nodeA`
2010-02-20 14:43:47.126: [  CRSRES][1080162624]0startRunnable: setting CLI values
2010-02-20 14:43:47.131: [  CRSRES][1080162624]0Attempting to start `ora.prod.someService.prod1.srv` on member `nodeA`
2010-02-20 14:43:47.214: [  CRSAPP][1080162624]0StartResource error for ora.prod.someService.prod1.srv error code = 1
2010-02-20 14:43:47.284: [  CRSRES][1080162624]0Start of `ora.prod.someService.prod1.srv` on member `nodeA` failed.
2010-02-20 14:43:47.306: [  CRSRES][1080162624]0startRunnable: setting CLI values
2010-02-20 14:43:47.311: [  CRSRES][1080162624]0Attempting to start `ora.prod.someService.prod1.srv` on member `nodeA`
2010-02-20 14:43:47.382: [  CRSAPP][1080162624]0StartResource error for ora.prod.someService.prod1.srv error code = 1
2010-02-20 14:43:47.454: [  CRSRES][1080162624]0Start of `ora.prod.someService.prod1.srv` on member `nodeA` failed.
2010-02-20 14:43:47.461: [  CRSRES][1080162624]0nodeB : CRS-1019: Resource ora.prod.someService.prod1.srv (application) cannot run on nodeB
nodeC : CRS-1019: Resource ora.prod.someService.prod1.srv (application) cannot run on nodeC

2010-02-20 14:43:48.037: [  CRSAPP][1492207936]0StartResource error for ora.prod.someService.cs error code = 1
2010-02-20 14:43:48.126: [  CRSRES][1492207936]0Start of `ora.prod.someService.cs` on member `nodeA` failed.
2010-02-20 14:43:48.134: [  CRSRES][1492207936]0nodeB : CRS-1019: Resource ora.prod.someService.cs (application) cannot run on nodeB
nodeC : CRS-1019: Resource ora.prod.someService.cs (application) cannot run on nodeC

This service can only run on node 1 by it’s definition, so no surprise it didn’t start on nodeB or nodeC:

[oracle@nodeC crsd]$ srvctl config service -d prod -s someService
someService PREF: prod1 AVAIL:

So what’s the status then of this service I wondered:

[oracle@nodeC crsd]$ srvctl status service -d prod -s someService
Service someService is not running on instance(s) prod1

However, the database thought the service was running:

select inst_id,name,blocked
from gv$active_services
where name = 'someService';

INST_ID  NAME                 BLO
-------- -------------------- ---
 1       someService          NO

So CRS didn’t know that the service was started, but neither could it start it since it was already up. Chicken and Egg problem of the classical sort here. Fortunately, there is a way around this by using PL/SQL (thank god!). In my case I connected to node 1 and stopped the service:

SQL> exec dbms_service.stop_service('someService')

PL/SQL procedure successfully completed.

You might have to do this for each instance gv$active_services return a row if there is more than one. Refer to the PL/SQL Packages and Types guide for more information about DBMS_SERVICE. Make sure to query gv$active_services again to ensure that the service is no longer running. Then you can start it through clusterware:

[oracle@nodeC crsd]$ srvctl start service -d prod -s someService

This should work now, you can verify the success in the node’s crsd.log:

2010-02-20 14:52:10.456: [  CRSRES][1527896384]0startRunnable: setting CLI values
2010-02-20 14:52:10.463: [  CRSRES][1527896384]0Attempting to start `ora.prod.someService.cs` on member `nodeA`
2010-02-20 14:52:10.561: [  CRSRES][1080162624]0startRunnable: setting CLI values
2010-02-20 14:52:10.566: [  CRSRES][1080162624]0Attempting to start `ora.prod.someService.prod1.srv` on member `nodeA`
2010-02-20 14:52:10.672: [  CRSRES][1080162624]0Start of `ora.prod.someService.prod1.srv` on member `nodeA` succeeded.
2010-02-20 14:52:11.478: [  CRSRES][1527896384]0Start of `ora.prod.someService.cs` on member `nodeA` succeeded.

“Succeeded” is the key word here. This was on 10.2.0.4.1 EE on RHEL 5.3 64bit, CRS 10.2.0.4 + CRS bundle patch# 4

Response

  1. […] 3-How to use dbms_service to stop a service in RAC for diagnosing service problems? Martin Bach-RAC service weirdness […]

Blog at WordPress.com.