RAC One Node and Database Protection
Posted by Martin Bach on February 16, 2011
An email from fellow Oak Table Member James Morle about RAC One Node and failover got me thinking about the capabilities of the product.
I have written about RON (Rac One Node) in earlier posts, but haven’t really explored what happens with session failover during a database relocation.
Overview
So to clarify what happens in these two scenarios I have developed a simple test. Taking a RON database + a service I modified both to suit my test needs. Connected to the service I performed a database relocation to see what happens. Next I killed the instance (I wasn’t able to reboot the node) t o simulate what happens when the node crashes.
Setup
The setup used an existing database, “RON”. It also had a service defined already, but that needed tweaking. The database was defined as follows:
$ srvctl config database -d RON Database unique name: RON Database name: RON Oracle home: /u01/app/oracle/product/11.2.0.2 Oracle user: oracle Spfile: +DATA/RON/spfileRON.ora Domain: example.com Start options: open Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Server pools: RON Database instances: Disk Groups: DATA Mount point paths: Services: RON_APP.example.com Type: RACOneNode Online relocation timeout: 30 Instance name prefix: RON Candidate servers: node1,node2 Database is administrator managed
The service was initially defined as follows:
srvctl config service -d RON Service name: RON_APP.example.com Service is enabled Server pool: RON Cardinality: 1 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: NONE Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: NONE TAF policy specification: BASIC Edition: Preferred instances: RON_1 Available instances: [rac]oracle@node1.example.com $
Some of these attributes require special attention, we have 3 categories to deal with: preferred instances for TAF, the TAF setup itself and the runtime load balancing (RLB) advisory.
Transparent Application Failover with “real” RAC only works if there are two preferred instances. As you can see the service has only one preferred instance. Not sure if that can be changed though…Let’s try:
srvctl modify service -s RON_APP -d RON -i RON_1,RON_2 PRKO-2007 : Invalid instance name: RON_2
I wasn’t surprised-RON is not a RAC database so it has only 1 active instance. When registering the RAC One Node database you don’t add instances as you would with a RAC database, instead you set the database type to RACONENODE (srvctl add database -d name -c RACONEONE … )
I recommend setting TAF properties on the service level-that way you don’t miss crucial parameters in your tnsnames.ora file. This is the preferred way of doing it at least since 11.2. Changing the service is straight forward:
srvctl modify service -d RON -s RON_APP.example.com \ > -P BASIC -e SESSION -m BASIC
This piece of code instructs the service to use a BASIC TAF policy, a failover type of SESSION and the BASIC failover method. These parameters were normally be configured in the CONNECT_DATA section of your TNSNames.ora file.
With these changes made, the service configuration has changed to the below:
$ srvctl config service -d RON -s RON_APP.example.com Service name: RON_APP.example.com Service is enabled Server pool: RON Cardinality: 1 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: SESSION Failover method: BASIC TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: NONE TAF policy specification: BASIC Edition: Preferred instances: RON_1 Available instances: [rac]oracle@node1.example.com $
I also wanted to change the defaults to a more suitable RLB configuration. Instead of a CLB goal of “LONG” I wanted to set it up for “SHORT”. And I needed emphasis on SERVICE_TIME as well (my intention was to run swingbench). The change and resulting service configuration are shown below:
$ srvctl modify service -d RON -s RON_APP.example.com \ > -j short -B service_time $ srvctl config service -d RON -s RON_APP.example.com Service name: RON_APP.example.com Service is enabled Server pool: RON Cardinality: 1 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: SESSION Failover method: BASIC TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: SHORT Runtime Load Balancing Goal: SERVICE_TIME TAF policy specification: BASIC Edition: Preferred instances: RON_1 Available instances: [rac]oracle@node1.example.com $
My SCAN name was scan1.example.com, and I ensured that the local_listener was pointing to my none default port of 1821 and the remote_lister was using the EZConnect synatax (“scan1.example.com:1825). It is very important to set the local_listener parameter if you are not using the default port of 1521!
On my client system I defined a local TNS alias “RON” to connect to the RON database. Note that it doesn’t use any TAF parameters.
C:\oracle\product\11.2.0\client_1\network\admin>tnsping ron TNS Ping Utility for 32-bit Windows: Version 11.2.0.2.0 - Production on 15-FEB-2011 17:07:44 Copyright (c) 1997, 2010, Oracle. All rights reserved. Used parameter files: C:\oracle\product\11.2.0\client_1\network\admin\sqlnet.ora Used TNSNAMES adapter to resolve the alias Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = scan1.example.com)(PORT = 1825) )) (CONNECT_DATA = (SERVICE_NAME = LRON_APP.uk.db.com))) OK (10 msec)
Database Relocation
Oracle’s promise is that you don’t lose your session during a database relocation. Let’s see if that is actually true. Using my setup I connected to the RON database (sorry for the broken formatting!):
C:\oracle\product\11.2.0\client_1\network\admin>sqlplus martin/test@ron SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 15 16:50:08 2011 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options SQL> select username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'; USERNAME FAILOVER_M FAILOVER_TYPE FAI ------------------------------ ---------- ------------- --- MARTIN BASIC SESSION NO
OK, so TAF is working. At this time I started the relocate command:
[RON_1]oracle@node1.example.com $ srvctl relocate database -d RON -n node2 -w 1 -v Configuration updated to two instances Instance RON_2 started
I then repeatedly required my SQL statement above while the instance was relocating. The output is shown below.
SQL> select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'; INST_ID USERNAME FAILOVER_M FAILOVER_TYPE FAI ---------- ------------------------------ ---------- ------------- --- 1 MARTIN NONE NONE NO 1 MARTIN BASIC SESSION NO 2 MARTIN NONE NONE NO SQL> select * from v$active_instances; INST_NUMBER INST_NAME ----------- ------------------------------------------------------------ 1 node1.example.com:RON_1 2 node2.example.com:RON_2 SQL> select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'; INST_ID USERNAME FAILOVER_M FAILOVER_TYPE FAI ---------- ------------------------------ ---------- ------------- --- 1 MARTIN NONE NONE NO 1 MARTIN BASIC SESSION NO 2 MARTIN NONE NONE NO
As you can see the configuration change to 2 instances is reflected in the output of my query to v$active_instances. You also see the number of sessions increasing, pay attention to the inst_id colum: a new session is created on the second instance.
In my other session I saw the relocation completing:
Services relocated Waiting for 1 minutes for instance RON_1 to stop..... Instance RON_1 stopped Configuration updated to one instance
I required one more time to see what happened-would my session survive?
SQL> / select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN' * ERROR at line 1: ORA-25408: can not safely replay call SQL> / INST_ID USERNAME FAILOVER_M FAILOVER_TYPE FAI ---------- ------------------------------ ---------- ------------- --- 2 MARTIN BASIC SESSION YES SQL> SQL> select * from v$active_instances; INST_NUMBER INST_NAME ----------- ------------------------------------------------------------ 2 node2.example.com:RON_2
Well it did survice. Note the ORA-25408 error. That’s expected, since I’m using the SQL*Plus client I don’t have the opportunity to trap the error and replay my OCI call. You should capture this SQLException in Java or your preferred development environment. I have provided an example in chapter 11 of Pro Oracle Database 11g RAC on Linux.
Node failure
I couldn’t see how a session would survive in the case of a node failure… I said this in my email to James:
> I cannot see how TAF or FCF work in case of a node failure. In my
> tests I did for the book TAF only worked if there was a service with
> at least 2 preferred instances. And FCF requires a FAN aware
> connection pool which is rare to find. RAC one however only has only 1
> active node (unless you relocate it).
But better test before jumping to conclusions!
I could only kill an instance rather than the server which would have been a better test. I assumed Clusterware would try to restart the failed instance on the same node a few times and then relocate the resource if the stat was not successful.
I knew the database was now running on node 2 so I killed the SMON process. That sure results in an instance crash.
oracle@node2.example.com $ ps -ef | grep RON | grep smon oracle 14208 1 0 16:52 ? 00:00:00 ora_smon_RON_2 $ kill -9 14208
Just as you would expect, the session didn’s survive this (how could it? There is no active second instance!)
SQL> / select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN' * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID: 15649 Session ID: 183 Serial number: 3
And just as expected, Clusterware restarted the failed instance the second it detected the failure. The node’s alert log showed this:
System State dumped to trace file /u01/app/oracle/product/admin/RON/admin/diag/rdbms/RON/RON_2/trace/RON_2_diag_14174.trc
ORA-1092 : opitsk aborting process
2011-02-15 16:58:29.255000 +00:00
ORA-1092 : opitsk aborting process
License high water mark = 6
2011-02-15 16:58:32.569000 +00:00
Instance terminated by PMON, pid = 14164
USER (ospid: 16101): terminating the instance
Instance terminated by USER, pid = 16101
2011-02-15 16:58:35.286000 +00:00
Starting ORACLE instance (normal)
2011-02-15 16:58:36.477000 +00:00
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
2011-02-15 16:58:42.687000 +00:00
Private Interface ‘bond1:1′ configured from GPnP for use as a private interconnect.
[name='bond1:1',...
..., use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/11.2.0.2/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 – 64bit Production
Summary
That concludes my testing. RON works better than a “classic” active/passive cluster and allows sessions to stay connected when migrating the database to a different host. And it makes it easier to convert the database to a full RAC database (I have an example but it needs a bit of tidying up before posting). On the other hand, virtualisation technology has allowed us to do the same for quite some time now. Xen and OracleVM can relocate domUs, and vmotion is the commercial alternative. Whatever suits your needs.
Tweets that mention RAC One Node and Database Protection -- Topsy.com said
[...] This post was mentioned on Twitter by mathewbutler, Martin Bach. Martin Bach said: RAC One Node and Database Protection http://wp.me/pDczL-am [...]
ghassem said
Thank you so much
It’s excelent
Client-failover for dataguard switchover and failover « Bitbach’s Blog said
[...] before 10.2.0.4 and unless being started with the -F option. Martin Bach writes in an article (RAC One Node and Database Protection) that it is more or less the responsibility of the calling environment to handle just this [...]
bdrouvot said
In case of Rac one Node database crash and using a TAF service : If you wait the instance is up again you will be able to launch your sql without seeing the crash.
The thing is : The sql has to be launched when the instance is providing the TAF service enven if its crashed before.
So :
connect
launch the sql => ok
db crash
wait the db is up again
re-launch the sql =>ok
Bertrand
RAC One Node : Avoid automatic database relocation « bdt's oracle blog said
[...] For a good understanding of what RAC One Node is, you can have a look to Martin’s post. [...]
TAF is not working on Rac One Node in case of node failure ? Well it depends | bdt's oracle blog said
[…] Martin Bach did a good study of Rac One Node capability and especially of what happens with session failover during a database relocation or during a node failure into this blog post. […]