This is a follow-up on yesterday’s post about services in the new 12c database architecture. After having worked out everything I needed to know about TAF and RAC 12c in CDBs I wanted to check how FCF works with PDBs today. While investigating I found out that the Runtime Load Balancing Advisory does not seem to work as expected in some cases. But I’m getting ahead of myself. First of all, here is my test case:
- Oracle Linux 6.4 x86-64
- Grid Infrastructure 12.1.0.1.2, i.e. January 2014 PSU applied
- RDBMS 12.1.0.1.2, likewise patched with the January PSU
- A CDB with just 1 PDB for this purpose, named DEMOPDB
- Service FCFSRV is used
Service setup
The service is specifically created to connect against the PDB:
[oracle@rac12node1 ~] srvctl add service -d rac12c -s FCFSRV -preferred RAC12C1,RAC12C2 -pdb DEMOPDB \ > -clbgoal short -rlbgoal SERVICE_TIME -notification true
The service has then been started and was running on both instances. The PDB was also opened read-write on all instances. I am using the same dequeue.pl script to dequeue FAN events as I already demonstrated in the RAC book. In normal operations both instances should be evenly loaded (more or less). Changes to the load on a RAC node should be communicated internally using Fast Application Notifications (FAN). In theory this allows connection pools to shift sessions from one instance to another and dynamically reflect changes in the cluster load. During my limited testing Oracle’s Universal Connection Pool (UCP) does this quite nicely, and I have written some articles about it. Search this site for UCP if you would like to read up on UCP first.
As you just read the mechanism employed is based on FAN events and explained in the Real Application Clusters Administration and Deployment Guide. The RTLB Advisory is perfectly suited for connection pools, and the right way forward in my opinion for new development projects.
The service definition shown above sets the connection load balancing (CLB) goal to short connections, typical for an application based on a connection pool. In this environment, you look up the pool using JNDI for example, grab a session, do something, and hand it back to the pool. I instructed the pool to optimise for service time, ie. to get a connection from the instance offering the best response time (as opposed to throughput, the other option).
Testing the RTLB Advisory with a PDB
With the services started and the PDB open I was ready to burn some CPU. This is quite simply done with a SQL script that I execute lots of times. First the SQL script (credit to Julian Dyke)
[oracle@rac12node2 burncpu]$ cat burn.sql DECLARE n NUMBER := 0; BEGIN FOR f IN 1..10000000 LOOP n := MOD (n,999999) + SQRT (f); END LOOP; -- DBMS_OUTPUT.PUT_LINE ('Res = '||TO_CHAR (n,'999999.99')); END; / exit
This script is called by my (certainly not perfect!) shell script:
[oracle@rac12node2 burncpu]$ cat burn.sh #!/bin/bash RUNS=${1:-10} SERVICE=${2:-FCFSRV} echo starting $RUNS runs to burn some CPU using service $SERVICE START_TIME=$SECONDS for i in $(seq 1 $RUNS) ; do sqlplus -s martin/secret@rac12scan.gns.example.com/${SERVICE} @burn & done wait ELAPSED_TIME=$(($SECONDS - $START_TIME)) echo CPU burnt in $ELAPSED_TIME seconds
Time to connect!
SQL> select name,cdb from v$database; NAME CDB --------- --- RAC12C YES SQL> @/home/oracle/lspdbs INST_ID NAME OPEN_MODE ---------- ------------------------------ ---------- 1 DEMOPDB READ WRITE 2 DEMOPDB READ WRITE
Now I am going to generate a lot of CPU load on the system, which should have an effect. In a different session I dequeue the FAN events. You may have noticed that I’m requesting “-notification true” in the service definition. This is needed for this example and .NET clients if memory serves me right. Here is the result of me creating some load:
top - 14:27:42 up 4 days, 23:04, 3 users, load average: 7.30, 2.52, 2.22 Tasks: 317 total, 16 running, 301 sleeping, 0 stopped, 0 zombie Cpu(s): 98.2%us, 1.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.2%hi, 0.0%si, 0.2%st Mem: 8059892k total, 6277936k used, 1781956k free, 34372k buffers Swap: 524280k total, 6144k used, 518136k free, 4143280k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7994 oracle 20 0 3379m 31m 28m R 13.3 0.4 0:04.35 oracle_7994_rac 7996 oracle 20 0 3379m 34m 31m R 13.3 0.4 0:04.35 oracle_7996_rac 7998 oracle 20 0 3379m 39m 35m R 13.3 0.5 0:04.36 oracle_7998_rac 8006 oracle 20 0 3378m 28m 26m R 13.3 0.4 0:04.34 oracle_8006_rac 8008 oracle 20 0 3378m 28m 26m R 13.3 0.4 0:04.33 oracle_8008_rac 8010 oracle 20 0 3379m 29m 26m R 13.3 0.4 0:04.34 oracle_8010_rac 8018 oracle 20 0 3378m 28m 26m R 13.3 0.4 0:04.34 oracle_8018_rac 8004 oracle 20 0 3378m 29m 26m R 12.6 0.4 0:04.02 oracle_8004_rac 8074 oracle 20 0 3378m 28m 25m R 12.6 0.4 0:03.64 oracle_8074_rac 8012 oracle 20 0 3379m 32m 29m R 12.3 0.4 0:04.03 oracle_8012_rac 8016 oracle 20 0 3378m 28m 26m R 12.3 0.4 0:04.03 oracle_8016_rac 8000 oracle 20 0 3378m 28m 26m R 11.9 0.4 0:04.03 oracle_8000_rac 8002 oracle 20 0 3379m 29m 26m R 11.9 0.4 0:04.03 oracle_8002_rac 8014 oracle 20 0 3378m 29m 26m R 11.9 0.4 0:04.03 oracle_8014_rac 8072 oracle 20 0 3378m 28m 25m R 11.9 0.4 0:03.63 oracle_8072_rac 4365 root RT 0 789m 101m 71m S 1.3 1.3 103:57.23 osysmond.bin
This isn’t too bad for a virtualised system with a dual-core “processor”. The other instance is more or less idle.
top - 14:27:59 up 4 days, 23:05, 4 users, load average: 0.23, 0.45, 1.08 Mem: 8059892k total, 5994680k used, 2065212k free, 177168k buffers Swap: 524280k total, 4024k used, 520256k free, 4347016k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9626 oracle 20 0 3381m 165m 159m S 9.3 2.1 0:04.64 ora_ppa7_rac12c 3813 root RT 0 789m 101m 71m S 1.7 1.3 84:00.64 osysmond.bin 1938 oracle 20 0 546m 37m 14m S 1.0 0.5 101:26.07 gipcd.bin 3710 oracle -2 0 1367m 16m 14m S 1.0 0.2 153:56.63 asm_vktm_+asm2
Now the big question is: does this have any effect? Going by my script and v$servicemetric it doesn’t. Here are the FAN events for the duration, reformatted for readability:
VERSION=1.0 database=RAC12C service=FCFSRV { {instance=RAC12C2 percent=50 flag=UNKNOWN aff=FALSE} {instance=RAC12C1 percent=50 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 14:30:36 VERSION=1.0 database=RAC12C service=FCFSRV { {instance=RAC12C2 percent=50 flag=UNKNOWN aff=FALSE} {instance=RAC12C1 percent=50 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 14:31:06 VERSION=1.0 database=RAC12C service=FCFSRV { {instance=RAC12C2 percent=50 flag=UNKNOWN aff=FALSE} {instance=RAC12C1 percent=50 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 14:31:36 VERSION=1.0 database=RAC12C service=FCFSRV { {instance=RAC12C2 percent=50 flag=UNKNOWN aff=FALSE} {instance=RAC12C1 percent=50 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 14:32:06
As you can see the percentage between instances is 50:50 all the time, despite the big difference in load on the nodes. The FAN events are documented in the same chapter of the RAC Admin Guide. The percent figure is used to describe “The percentage of work requests to send to this database instance.”. The observation leads me to suspect there is something not quite right with the figures above.
UPDATE
After retesting with a different system I came to the conclusion that the above is an edge case. As you saw in the shell script I am connecting to a service. The service in turn is available on both instances. Yet all my sessions ended up on just one node, which shouldn’t be the case. I double-checked my DNS round-robin resolution for the SCAN but couldn’t find anything wrong with it. Unfortunately I no longer had access to the original system to repeat my testing and I recreated a new 12.1.1.2 cluster:
12:46:44 SQL> select xmltransform(dbms_qopatch.get_opatch_list(), dbms_qopatch.GET_OPATCH_XSLT()) from dual; XMLTRANSFORM(DBMS_QOPATCH.GET_OPATCH_LIST(),DBMS_QOPATCH.GET_OPATCH_XSLT()) -------------------------------------------------------------------------------- Patch Details: Patch(sqlpatch) 17552800: applied on 2014-03-17T11:01:42-04:00 Unique Patch ID: 16974186 Patch Description: Database Patch Set Update : 12.1.0.1.2 (17552800) Created on : 3 Jan 2014, 03:21:20 hrs PST8PDT Files Touched: Patch 17077442: applied on 2014-03-17T11:01:25-04:00 Unique Patch ID: 16881794 Patch Description: Oracle Clusterware Patch Set Update 12.1.0.1.1 Created on : 12 Oct 2013, 06:33:53 hrs US/Central Files Touched:
I created two test cases. In case 1 I connect to just one instance to confirm the observation made above. In the second test I try to evenly connect to both instances. The PDB in this case is ACDEMOPDB and the service name is PDB_FCF:
srvctl add service -d RAC12C -s PDB_FCF -preferred RAC12C1,RAC12C2 -failovertype transaction \ > -pdb ACDEMOPDB -clbgoal short -rlbgoal service_time -notification true -commit_outcome true
I needed to change my connection information for this to work, see below for each example.
Case 1: I am connecting to just 1 instance:
FANTEST_SI = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = rac12scan.example.com)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = PDB_FCF) (INSTANCE_NAME = RAC12C2) ) )
In my test case I executed the burn.sql script in this way:
for i in $(seq 1 10); do $ORACLE_HOME/bin/sqlplus martin/secret@fantest_si @burn & done
This is the same as in the above test where only 1 instance was hammered. There is no reflection of the load in the FAN load balancing events.
Case 2: I am connecting to all instances served by the service
Same setup, but this time I connect to all instances:
FANTEST_ALL = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = rac12scan.example.com)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = PDB_FCF) ) )
The invocation of the load test is the same except for the service name:
for i in $(seq 1 10); do $ORACLE_HOME/bin/sqlplus martin/secret@fantest_all @burn & done
And finally I got the much expected FAN events! For some reason 6 sessions ended up on node 2 and only 4 on node 2:
SQL> select count(inst_id),inst_id from gv$session 2 where con_id = 3 and username = 'MARTIN' group by inst_id; COUNT(INST_ID) INST_ID -------------- ---------- 4 1 6 2
Examples for the FAN events recorded are:
VERSION=1.0 database=RAC12C service=PDB_FCF { {instance=RAC12C2 percent=45 flag=GOOD aff=FALSE} {instance=RAC12C1 percent=55 flag=GOOD aff=FALSE} } timestamp=2014-03-22 12:28:53 VERSION=1.0 database=RAC12C service=PDB_FCF { {instance=RAC12C2 percent=41 flag=GOOD aff=FALSE} {instance=RAC12C1 percent=59 flag=GOOD aff=FALSE} } timestamp=2014-03-22 12:29:53
So the lesson learnt (and apologies for the confusion!) is that FAN load balancing events seem to require you to log into the service, and without specifying the instance. And they also seem to be slightly different if you connect to the service and you end up on just 1 instance for whatever reason.
Repeating the test on a Non-CDB
Now just to prove this does actually work I created a non-CDB and repeated the test with database NCDB. To cut the long story short, here are my results:
VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=50 flag=GOOD aff=TRUE} {instance=NCDB1 percent=50 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 13:55:09 VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=55 flag=GOOD aff=TRUE} {instance=NCDB1 percent=45 flag=GOOD aff=FALSE} } timestamp=2014-02-18 13:55:39 VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=71 flag=GOOD aff=TRUE} {instance=NCDB1 percent=29 flag=GOOD aff=FALSE} } timestamp=2014-02-18 13:56:09 VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=78 flag=GOOD aff=TRUE} {instance=NCDB1 percent=22 flag=GOOD aff=FALSE} } timestamp=2014-02-18 13:56:39 VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=64 flag=GOOD aff=TRUE} {instance=NCDB1 percent=36 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 13:57:09 VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=57 flag=GOOD aff=TRUE} {instance=NCDB1 percent=43 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 13:57:39 VERSION=1.0 database=NCDB service=FCFSRV { {instance=NCDB2 percent=54 flag=GOOD aff=TRUE} {instance=NCDB1 percent=46 flag=UNKNOWN aff=FALSE} } timestamp=2014-02-18 13:58:09 ^Cexiting...
Seems like another case to raise with Oracle Support. In the above output you can see how the system would have sent sessions to instance 2 instead of instance 1 where I generated all the load.
Responses
Is your v$servicemetric query against the pluggable or container database?
Just repeated the test-no difference between CDB$ROOT and DEMOPDB, which was what I hoped for. The FAN events I dequeued are what your connection pool would receive so there is definitely a problem there.