As part of a server move from one data centre to another I enjoyed working in the depths of Clusterware. This one has been a rather simple case though: the public IP addresses were the only part of the package to change: simple. One caveat though was the recreation of the OCR disk group I am using for the OCR and 3 copies of the voting file. I decided to reply on the backups I took before the server move.
Once the kit has been rewired in the new data centre, it was time to get active. The /etc/multipath.conf file had to be touched to add the new LUNs for my +OCR disk group. I have described the processes in a number of articles, for example here:
https://martincarstenbach.wordpress.com/2011/01/14/adding-storage-dynamically-to-asm-on-linux/
A few facts before we start:
- Oracle Enterprise Linux 5.5 64bit
- device-mapper-multipath-0.4.7
- Grid Infrastructure 11.2.0.2.2 (actually it is Oracle Database SAP Bundle Patch 11.2.0.2.2)
- ASMLib
I have already described how to restore the OCR and voting files in 11.2.0.1 in “Pro Oracle Database RAC 11g on Linux”, but since then the procedure has changed slightly I thought I’d add this here. The emphasis is on “slightly”.In this blog post I’ll describe what you need to do if you lose the disk group containing OCR and voting disks on a Linux system using ASMLib. Before the server move I recorded the location of the OCR/voting disk disk group:
SQL> select d.name, d.path, dg.name as dg_name 2 from v$asm_disk d, v$asm_diskgroup dg 3 where d.group_number = dg.group_number 4 and dg.name = 'OCR' 5 / NAME PATH DG_NAME ---------- -------------------- ---------- OCR0001 ORCL:OCR0001 OCR OCR0002 ORCL:OCR0002 OCR OCR0003 ORCL:OCR0003 OCR SQL>
After the server has come back on the network, I first ensured everything was stopped:
[root@node1 cluster01]# crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1' CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'node1' CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'node1' CRS-2673: Attempting to stop 'ora.crf' on 'node1' CRS-2673: Attempting to stop 'ora.diskmon' on 'node1' CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1' CRS-2677: Stop of 'ora.cssdmonitor' on 'node1' succeeded CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded CRS-2677: Stop of 'ora.crf' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gipcd' on 'node1' CRS-2677: Stop of 'ora.drivers.acfs' on 'node1' succeeded CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1' CRS-2677: Stop of 'ora.diskmon' on 'node1' succeeded CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed CRS-4133: Oracle High Availability Services has been stopped.
The next step is to start the cluster in exclusive mode. In 11.2.0.1 it was enough to just use the crsctl start crs -excl, from 11.2.0.2 onwards you also have to add the -nocrs flag. If you don’t, crsd will try to start, but can’t find a voting file and everything spins/hangs until the Clusterware runs out of retries and the command fails. Here’s the example output with the correct command syntax:
[root@node1 cluster01]# crsctl start crs -excl -nocrs CRS-4123: Oracle High Availability Services has been started. CRS-2672: Attempting to start 'ora.mdnsd' on 'node1' CRS-2676: Start of 'ora.mdnsd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'node1' CRS-2676: Start of 'ora.gpnpd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1' CRS-2672: Attempting to start 'ora.gipcd' on 'node1' CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded CRS-2676: Start of 'ora.gipcd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'node1' CRS-2672: Attempting to start 'ora.diskmon' on 'node1' CRS-2676: Start of 'ora.diskmon' on 'node1' succeeded CRS-2676: Start of 'ora.cssd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.drivers.acfs' on 'node1' CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'node1' CRS-2672: Attempting to start 'ora.ctssd' on 'node1' CRS-2676: Start of 'ora.drivers.acfs' on 'node1' succeeded CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'node1' succeeded CRS-2672: Attempting to start 'ora.asm' on 'node1' CRS-2676: Start of 'ora.asm' on 'node1' succeeded
As I said, I have created the disk group with exactly the same name as the one lost. This is very important, or the restore won’t work. The $GRID_HOME/log/`hostname`/client directory contains logs in case you have to troubleshoot. Inside the $GRID_HOME/cdata/<cluster name> directory, you find all the relevant backups. Ensure you are using the latest backup-this is on the OCR master node. Check each backup directory on the cluster nodes to find the most recent backup. Once the most recent backup has been located, restore it:
[root@node1 cluster01]# ocrconfig -restore backup00.ocr
This worked, as shown in the alert<hostname>.log file in $GRID_HOME/log/`hostname`/:
[/u01/crs/product/11.2.0.2/bin/oraagent.bin(32015)]CRS-5019:All OCR locations are on ASM disk groups [OCR], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/crs/product/11.2.0.2/log/node1/agent/ohasd/oraagent_oracle/oraagent_oracle.log". 2011-10-06 11:03:38.877 [client(1444)]CRS-1002:The OCR was restored from file backup00.ocr.
All right, that sorts the OCR out. Now it’s time to restore the voting disks:
[root@node1 cluster01]# crsctl query css votedisk Located 0 voting disk(s). [root@node1 cluster01]# crsctl replace votedisk +OCR Successful addition of voting disk 361e36921dd64f89bfd63cdbade79651. Successful addition of voting disk 58f769be54e74fbcbfc655afe290268d. Successful addition of voting disk af6c1890bb594f72bf39ef626b8fcc8f. Successfully replaced voting disk group with +OCR. CRS-4266: Voting file(s) successfully replaced [root@node1 cluster01]# crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 361e36921dd64f89bfd63cdbade79651 (ORCL:OCR0001) [OCR] 2. ONLINE 58f769be54e74fbcbfc655afe290268d (ORCL:OCR0002) [OCR] 3. ONLINE af6c1890bb594f72bf39ef626b8fcc8f (ORCL:OCR0003) [OCR] Located 3 voting disk(s).
This, too, is also recorded, for example in the CSSD log file:
2011-10-06 11:04:59.169 [cssd(32100)]CRS-1605:CSSD voting file is online: ORCL:OCR0001; details in /u01/crs/product/11.2.0.2/log/node1/cssd/ocssd.log. 2011-10-06 11:04:59.169 [cssd(32100)]CRS-1605:CSSD voting file is online: ORCL:OCR0002; details in /u01/crs/product/11.2.0.2/log/node1/cssd/ocssd.log. 2011-10-06 11:04:59.169 [cssd(32100)]CRS-1605:CSSD voting file is online: ORCL:OCR0003; details in /u01/crs/product/11.2.0.2/log/node1/cssd/ocssd.log. 2011-10-06 11:04:59.170 [cssd(32100)]CRS-1626:A Configuration change request completed successfully 2011-10-06 11:04:59.179 [cssd(32100)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 .
Now all that remains is to get the cluster back into “normal” mode-stop it and start it, as shown here:
[root@node1 cluster01]# crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1' CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1' CRS-2673: Attempting to stop 'ora.ctssd' on 'node1' CRS-2673: Attempting to stop 'ora.asm' on 'node1' CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'node1' CRS-2677: Stop of 'ora.asm' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'node1' CRS-2677: Stop of 'ora.drivers.acfs' on 'node1' succeeded CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'node1' succeeded CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.cssd' on 'node1' CRS-2677: Stop of 'ora.cssd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gipcd' on 'node1' CRS-2673: Attempting to stop 'ora.diskmon' on 'node1' CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1' CRS-2677: Stop of 'ora.diskmon' on 'node1' succeeded CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed CRS-4133: Oracle High Availability Services has been stopped. [root@node1 cluster01]# crsctl start cluster CRS-4639: Could not contact Oracle High Availability Services CRS-4000: Command Start failed, or completed with errors. [root@node1 cluster01]# crsctl start crs CRS-4123: Oracle High Availability Services has been started.
Now all you need to do is wait, and check for the cluster status:
[root@node2 ~]# crsctl check cluster -all ************************************************************** node1: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online ************************************************************** node2: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online **************************************************************
All is well that ends well, the procedure is much the same it always was. Just remember the “-nocrs” flag.
Reference
How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems [ID 1062983.1]
Hi Martin,
Nice!
Just curious, did you need the -f in the
crsctl stop crs -f,
when you are stoping after all the restoring is done?
jason.
Another option would be to put a copy of the OCR to another disk group (e.g. DATA) with ocrconfig and drop the old one. After that you might want to exchange the voting disks one after each other or replace them completely.