In a previous post I shared how I found out that RAC One Node databases are relocated on-line during patching and I promised a few more tests and sharing of implications. If you aren’t familiar with RAC One Node I recommend having a look at the official documentation: The Real Application Clusters Administration and Deployment Guide features it prominently in the introduction. One of the things I like to keep in mind when working with RAC One Node is this, quoting from section 1.3 in said admin and deployment guide:
Using the Single Client Access Name (SCAN) to connect to the database, clients can locate the service independently of the node on which it is running. Relocating an Oracle RAC One Node instance is therefore mostly transparent to the client, depending on the client connection. Oracle recommends to use either Application Continuity and Oracle Fast Application Notification or Transparent Application Failover to minimize the impact of a relocation on the client.
A little later, you can read this important additional piece of information in the “notes” section:
To prepare for all failure possibilities, you must add at least one Dynamic Database Service (Oracle Clusterware-managed database service) to an Oracle RAC One Node database.
If you created your RAC One database with the database creation assistant (dbca), you are already complying with that rule. In my case, my lab database is named RON with the mandatory service RON_SVC. I opted for an instance prefix of “DCB”.
What I’d like to try out for this blog post is what happens to an active workload on a RAC One database during patching.
I have successfully patched multi-node RAC systems, but that required the application to support this procedure. One of my favourite talks is named “Advanced RAC programming techniques” where I demonstrate the resilience of an application based on RAC to node failures by using Universal Connection Pool (UCP), JDBC and Fast Connection Failover (FCF). UCP is a Fast Application Notification (FAN) aware connection pool allowing my application to react to cluster events such as node up/node down. The idea is to mask instance failure from users.
If all of this sounds super-cryptic, I would like to suggest chapters 5 and 6 of the RAC administration and deployment guide for you. There you can read more about Workload Management with Dynamic Database Services and Application Continuity in RAC.
RAC One Node is different from multi-node RAC as it is only ever active on one node in normal operations. Online relocation, as described in the previous post, is supported by temporarily adding a second (destination) instance to the cluster, and moving all transactions across before terminating them after a time-out (default 30m). Once the time-out is reached (or all sessions made it across to the destination) the source instance is shut down and you are back to 1 active instance.
The online relocation does not care too much about the compatibility of the application with the process. If your application is written for RAC, most likely it will migrate quickly from source to destination instance. If it isn’t, well, the hard timeout might kick in and evict a few of your users. In a worse case your users need to re-connect to the database. Even worse still, you might have to restart the middle-tier system(s).
Sadly I haven’t seen too many applications capable of handling RAC events properly. One application that does is Swingbench, so I’ll stick with it. I configured it according to Dominic Giles’s blog post.
This post assumes that you are properly licensed to use all these features.
My setup hasn’t changed from previous post so I spare you the boredom of repeating it here. Jump over to the other post for details.
For this blogpost I need to ensure that my RAC One node database resides on the node I am about to patch. I have again followed the patch readme very carefully, I have made sure that I have (proven, working, tested) backups of the entire stack and all the rest of it…
My database is indeed actively running on the node I am about to patch:
[oracle@rac122sec2 ~]$ srvctl status database -db DCB -detail -verbose Instance DCB_1 is running on node rac122sec2 with online services RON_SVC. Instance status: Open. Instance DCB_1 is connected to ASM instance +ASM2 Online relocation: INACTIVE
Before I can benefit from Application Continuity, I need to make a few changes to my application service, RON_SVC. There are quite a few sources to choose from, I went with the JDBC Developer’s Guide. Here’s the modification:
[oracle@rac122sec2 ~]$ srvctl modify service -db DCB -service RON_SVC -commit_outcome true \ > -failoverretry 30 -failoverdelay 10 -failovertype transaction \ > -replay_init_time 1800 -retention 86400 -notification true [oracle@rac122sec2 ~]
Following the instructions on Dominic Giles’s blog, I also need to grant SOE the right to execute DBMS_APP_CONT.
Finally, I need to make changes to my Swingbench configuration file. The relevant part is shown here:
<SwingBenchConfiguration xmlns="http://www.dominicgiles.com/swingbench/config"> <Name>"Order Entry (PLSQL) V2"</Name> <Comment>Version 2 of the SOE Benchmark running in the database using PL/SQL</Comment> <Connection> <UserName>soe</UserName> <Password></Password> <ConnectString>(DESCRIPTION=(TRANSPORT_CONNECT_TIMEOUT=5) (RETRY_COUNT=3)(FAILOVER=ON) (ADDRESS=(PROTOCOL=TCP)(HOST=rac122sec-scan)(PORT=1521)) (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=RON_SVC)))</ConnectString> <DriverType>Oracle jdbc Driver</DriverType> <ConnectionPooling> <PooledInitialLimit>5</PooledInitialLimit> <PooledMinLimit>5</PooledMinLimit> <PooledMaxLimit>10</PooledMaxLimit> <PooledInactivityTimeOut>50</PooledInactivityTimeOut> <PooledConnectionWaitTimeout>45</PooledConnectionWaitTimeout> <PooledPropertyCheckInterval>10</PooledPropertyCheckInterval> <PooledAbandonedConnectionTimeout>120</PooledAbandonedConnectionTimeout> </ConnectionPooling> <Properties> <Property Key="StatementCaching">50</Property> <Property Key="FastFailover">true</Property> <Property Key="FetchSize">20</Property> <Property Key="OnsConfiguration">nodes=rac122sec1:6200,rac122sec2:6200</Property> <Property Key="FetchSize">20</Property> <Property Key="AppContinuityDriver">true</Property> </Properties> </Connection> <Load> <NumberOfUsers>5</NumberOfUsers> <MinDelay>0</MinDelay> <MaxDelay>0</MaxDelay> <InterMinDelay>50</InterMinDelay> <InterMaxDelay>500</InterMaxDelay> <QueryTimeout>120</QueryTimeout> <MaxTransactions>-1</MaxTransactions> <RunTime>0:0</RunTime> <LogonGroupCount>1</LogonGroupCount> <LogonDelay>1000</LogonDelay> <LogOutPostTransaction>false</LogOutPostTransaction> <WaitTillAllLogon>false</WaitTillAllLogon> <StatsCollectionStart>0:0</StatsCollectionStart> <StatsCollectionEnd>0:0</StatsCollectionEnd> <ConnectionRefresh>0</ConnectionRefresh> <TransactionList> ...
The connection string is actually on a single line, I have formatted it for readability in the above example. The main change from the standard configuration file is the use of connection pooling and setting the required properties for Application Continuity.
Once all the preparations are completed, it’s time to see how RAC One Node deals with an active workload undergoing an online relocation during patching. First of all I need to start the workload. I’d normally use charbench for this, but this time around opted for the GUI. It shows performance graphs over a 3 minute rolling window.
A few minutes after starting the benchmark I commenced patching. Soon thereafter, opatchauto told me that it was relocating the database.
OPatchauto session is initiated at Fri Jul 27 14:52:23 2018 ... Relocating RACOne home before patching on home /u01/app/oracle/product/188.8.131.52/dbhome_1
According to the session log, this happened at 14:54. And by the way, always keep the session log, it’s invaluable!
2018-07-27 14:54:07,914 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - Executing command as oracle: /bin/sh -c 'ORACLE_HOME=/u01/app/oracle/product/184.108.40.206/dbhome_1 /u01/app/oracle/product/220.127.116.11/dbhome_1/bin/srvctl relocate database -d DCB -n rac122sec1 -w 30 -v' 2018-07-27 14:54:07,914 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - COMMAND Looks like this: /bin/sh -c 'ORACLE_HOME=/u01/app/oracle/product/18.104.22.168/dbhome_1 /u01/app/oracle/product/22.214.171.124/dbhome_1/bin/srvctl relocate database -d DCB -n rac122sec1 -w 30 -v' 2018-07-27 14:55:31,736 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - status: 0 2018-07-27 14:55:31,737 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - Successfully executed the above command. 2018-07-27 14:55:31,737 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - Output from the command: 2018-07-27 14:55:31,737 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - Configuration updated to two instances Instance DCB_2 started Services relocated Waiting for up to 30 minutes for instance DCB_1 to stop ... Instance DCB_1 stopped Configuration updated to one instance 2018-07-27 14:55:31,737 INFO  com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - Command executed successfully.
You can see the same information by querying Clusterware, although there aren’t any timestamps associated with it:
[root@rac122sec1 ~]# srvctl status database -db DCB -detail -verbose Instance DCB_1 is running on node rac122sec2 with online services RON_SVC. Instance status: Open. Instance DCB_1 is connected to ASM instance +ASM2 Online relocation: ACTIVE Source instance: DCB_1 on rac122sec2 Destination instance: DCB_2 on rac122sec1 [root@rac122sec1 ~]#
Although the online relocation timeout was set to 30 minutes, use of modern coding techniques and connection pooling allowed for a much faster online relocation. As you can see in the log excerpt the entire relocation was completed 2018-07-27 14:55:31,737. Clusterware now tells me that my database runs on node 1:
[root@rac122sec1 ~]# date; srvctl status database -db DCB -detail -verbose Fri 27 Jul 14:55:38 2018 Instance DCB_2 is running on node rac122sec1 with online services RON_SVC. Instance status: Open. Instance DCB_2 is connected to ASM instance +ASM1 Online relocation: INACTIVE
While this is wicked, what are the implications for the application? Have a look at this print screen, taken a minute after the online relocation completed.
As you can see there was nothing unusual recorded (tab events to the left), and I couldn’t see a drop in the number of sessions connected. I noticed a slight blip in performance though but it recovered very soon thereafter.
During automatic patching of the Oracle stack opatchauto will perform an online relocation of a RAC One Node database if it is found running on the node currently undergoing patching. If your application is developed with RAC in mind – such as using a FAN-aware connection pool like UCP, and either supports Fast Connection Failover or Application Continuity, there is a good chance that patching the system does not affect your service.
While my test was successful, it is by no means representative of a production workload – my RAC One database has a SGA of 4 GB and there were 5 concurrent users – hardly what I see in the real world. However that shouldn’t deter you: if you are using RAC One Node I think it’s definitely worth a try implementing modern technology into the application.