This is a weird problem I ran in today. As part of an automation project the code deploys RAC One databases across a cluster, depending on the capacity available of the node. These are 128G RAM BL685c G6 currently but will be upgraded to G7 later.
Now, my problem was that after the weekend we couldn’t deploy any more RAC One databases, except for 1 node. DBCA simply created single instance databases instead. Newly created databases were properly registered in the OCR, and their build completed ok, but not as RAC One databases. Take for example this database:
$ srvctl config database -d MYDB Database unique name: MYDB Database name: MYDB Oracle home: /u01/app/oracle/product/220.127.116.11 Oracle user: oracle Spfile: +DATA/MYDB/spfileMYDB.ora Domain: example.com Start options: open Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Server pools: MYDB Database instances: MYDB Disk Groups: DATA Mount point paths: Services: Type: SINGLE Database is administrator managed
How come? We are sure that we pass the RACOneNode flag to dbca, which can be found in the command line. Trying again I spotted these (alongside the sys and system passwords … you should change these as soon as DBCA completes!)
[rac]email@example.com $ ps -ef|grep MYDB oracle 14865 14854 65 11:22 ? 00:00:07 /u01/app/oracle/product/18.104.22.168/jdk/jre/bin/java -Doracle.installer.not_bootstrap=true -DORACLE_HOME=/u01/app/oracle/product/22.214.171.124 -DSET_LAF= -Dsun.java2d.font.DisableAlgorithmicStyles=true -Dice.pilots.html4.ignoreNonGenericFonts=true -DDISPLAY= -DJDBC_PROTOCOL=thin -mx128m -classpath ... oracle.sysman.assistants.dbca.Dbca -silent -createDatabase -templateName /u01/app/oracle/product/admin/templates/Default.dbc -gdbName MYDB.example.com -RACOneNode -RACOneNodeServiceName MYDB_APP.example.com -sid MYDB -sysPassword xxx -systemPassword xxx -emConfiguration NONE -totalMemory 4096 -storageType ASM -asmSysPassword xxx -diskGroupName DATA -initParams db_create_file_dest=+DATA,cpu_count=1 -nodelist node2 oracle 15415 5109 0 11:22 pts/0 00:00:00 grep MYDB
So why the problem? Looking at the dbca trace file I found these lines
[main] [ 2011-03-14 14:15:31.845 GMT ] [SQLEngine.initialize:363] Starting Reader Thread... [main] [ 2011-03-14 14:15:31.927 GMT ] [OracleHome.initOptions:1240] executing: startup nomount pfile='/u01/app/oracle/product/126.96.36.199/dbs/initDBUA0.ora' [main] [ 2011-03-14 14:15:55.417 GMT ] [SQLEngine.done:2167] Done called [main] [ 2011-03-14 14:15:55.418 GMT ] [OracleHome.initOptions:1247] ORA-00304: requested INSTANCE_NUMBER is busy oracle.sysman.assistants.util.sqlEngine.SQLFatalErrorException: ORA-00304: requested INSTANCE_NUMBER is busy at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeImpl(SQLEngine.java:1655) at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeSql(SQLEngine.java:1903) at oracle.sysman.assistants.util.OracleHome.initOptions(OracleHome.java:1241) at oracle.sysman.assistants.dbca.backend.SilentHost.initialize(SilentHost.java:179) at oracle.sysman.assistants.dbca.Dbca.execute(Dbca.java:116) at oracle.sysman.assistants.dbca.Dbca.main(Dbca.java:180) [main] [ 2011-03-14 14:15:55.420 GMT ] [OracleHome.initOptions:1250] executing: select parameter from v$option where value='TRUE' [main] [ 2011-03-14 14:15:55.420 GMT ] [SQLEngine.reInitialize:735] Reinitializing SQLEngine...
Interesting-the ORA-304 error sticks out. The DBCA logs are in $ORACLE_BASE/cfgtoollogs/dbca/dbName/ in 11.2 btw. Further down the logfile it then determines that the RAC option is not available. This isn’t true-and I checked on each node:
$ cd $ORACLE_HOME/rdbms/lib $ nm -r libknlopt.a | grep -c kcsm.o 1
That was identical on all nodes. So we definitely had RAC compiled into the oracle binary. I also compared the size and timestamp of all oracle binaries in $ORACLE_HOME only to find them identical. However dbca didn’t seem impressed with my contradiction and went on creating single instance databases. That now became a little inconvenient.
I then tried relocating one of the succesfully created RAC One databases to the nodes where we had problems building them, hoping to find out more about the problem. At this stage I was convinced there was a problem with semophores or other SysV IPC.
I ceratainly didn’t want to use the Windows Fix and reboot!
So to recap, we should be able to build RAC (One Node) databases as the option is compiled into the binary, and yet it doesn’t work. From the trace I gathered that Oracle builds an auxiliary instance first, and uses initDBUA0.ora in $ORACLE_HOME to start it. So where are it’s logs/where’s the diagnostic dest? Turns out it is in $ORACLE_HOME/log/ – simply set your ADR base to this location and use the familiar commands. And this finally give me a clue:
*** 2011-03-14 12:06:40.104 2011-03-14 12:06:40.104: [ CSSCLNT]clssgsGroupJoin: member in use group(0/DBDBUA0) kgxgnreg: error: status 14 kgxgnreg: error: member number 0 is already in use kjxgmjoin: can not join the group (DBDBUA0) with id 0 (inst 1) kjxgmjoin: kgxgn error 3
So somewhere else in the cluster had to be a DBUA0 instance that prevented my new instance from starting. A quick trawl through the process table on all nodes revealed that DBUA was active on node6. Shutting that down solved the problem!
DBCA is a nice tool to create databases, together with user definable templates it is really flexible. From a technical point of view it works as follows:
- For RAC and RAC One Node it tries to create an auxiliary instance, called DBUA0, as a cluster database. If DBUA0 is used on the same node, it will use DBUA1 etc.
- Next it will rename the database to what we assign on the command line
- It then performs a lot more actions which are not of relevance here.
In my case, one of these DBUA0s aux instances was still present on a different node in the cluster as a result of a crashed database creation. When subsequent calls to dbca created another auxiliary (cluster!) DBUA0 instance on a different node, it wasn’t aware that there was a DBUA0 already and LMON refused to create it. This is expected behaviour- instance names have to be unique across the cluster. The DBUA0 of node2 for example clashed with the one on node6.
Why did it work on p6 then I hear you ask? DBCA seems to have code logic to establish that DBUA0 on a node is in use, and uses DBUA1 next.
I got this update from Oracle Support who acknowledge this as a bug:
|Notes||16-Mar-2011 1:21:16 GMT+00:00 PM||Oracle Support|
Following bug is created for this particular issue.
Bug 11877668 – DBCA DOESN’T CREATE A RAC ONE DATABASE IN SILENT MODE