In the very lengthy previous post about the MAA connect string I wanted to explain the use of the MAA connection string as promoted by Oracle. I deliberately kept the first part simple: both primary and standby cluster were up, and although the database was operating in the primary role on what I called standby cluster (again it’s probably not a good idea to include the intended role in the infrastructure names) there was no penalty establishing a connection.
As pointed out by readers of the blog entry that is of course only one part of the story (you may have guessed by the TNS alias MAA_TEST1 … there are more to come). When you define the connection string this way, you cater for the situation where something goes wrong. Let’s try a few more scenarios. Here is the current Data Guard situation:
DGMGRL> show configuration Configuration - martin Protection Mode: MaxPerformance Members: CDB - Primary database Error: ORA-16810: multiple errors or warnings detected for the database STDBY - Physical standby database Error: ORA-12543: TNS:destination host unreachable Fast-Start Failover: DISABLED Configuration Status: ERROR (status updated 71 seconds ago) DGMGRL> show database 'CDB' Database - CDB Role: PRIMARY Intended State: TRANSPORT-ON Instance(s): CDB1 Error: ORA-16737: the redo transport service for standby database "STDBY" has an error CDB2 Error: ORA-16737: the redo transport service for standby database "STDBY" has an error Database Status: ERROR
As you can see CDB is operating in primary role, and redo transport is broken. This simple reason for broken redo transport is the fact that the standby cluster in its entirety is down. For the sake of completeness I have repeated the MAA connection string here:
MAA_TEST1 = (DESCRIPTION_LIST= (LOAD_BALANCE=off)(FAILOVER=on) (DESCRIPTION= (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3) (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST=rac12pri-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=rootsrv))) (DESCRIPTION= (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3) (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST= rac12sby-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=rootsrv))) )
Scenario 1: shut down of the standby cluster, database in primary role on primary cluster
This does not seem to be much of a problem. Using the same test as in part 1 I get no noteworthy penalty at all:
[oracle@lab tns]$ time sqlplus system/secret@maa_test1 <<EOF select sys_context('userenv','instance_name') from dual; exit EOF SQL*Plus: Release 12.1.0.2.0 Production on Thu Apr 23 10:11:53 2015 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Thu Apr 23 2015 09:59:48 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> SYS_CONTEXT('USERENV','INSTANCE_NAME') -------------------------------------------------------------------------------- CDB2 SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m0.246s user 0m0.020s sys 0m0.025s [oracle@lab tns]$ time sqlplus system/secret@maa_test1 <<EOF select sys_context('userenv','instance_name') from dual; exit EOF SQL*Plus: Release 12.1.0.2.0 Production on Thu Apr 23 10:12:12 2015 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Thu Apr 23 2015 10:11:52 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> SYS_CONTEXT('USERENV','INSTANCE_NAME') -------------------------------------------------------------------------------- CDB1 SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m0.171s user 0m0.018s sys 0m0.018s [oracle@lab tns]$
Trying from a 12.1.0.1.0 client on a different machine:
[oracle@oracledev ~]$ time sqlplus system/secret@maa_test1 <<EOF > select sys_context('userenv','instance_name') from dual; > exit > EOF SQL*Plus: Release 12.1.0.1.0 Production on Thu Apr 23 10:13:50 2015 Copyright (c) 1982, 2013, Oracle. All rights reserved. Last Successful login time: Thu Apr 23 2015 10:12:11 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> SYS_CONTEXT('USERENV','INSTANCE_NAME') -------------------------------------------------------------------------------- CDB1 SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m0.427s user 0m0.012s sys 0m0.027s [oracle@oracledev ~]$ time sqlplus system/secret@maa_test1 <<EOF select sys_context('userenv','instance_name') from dual; exit EOF SQL*Plus: Release 12.1.0.1.0 Production on Thu Apr 23 10:13:52 2015 Copyright (c) 1982, 2013, Oracle. All rights reserved. Last Successful login time: Thu Apr 23 2015 10:13:50 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> SYS_CONTEXT('USERENV','INSTANCE_NAME') -------------------------------------------------------------------------------- CDB2 SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m0.158s user 0m0.012s sys 0m0.016s
That seems ok, but can be easily explained by the (LOAD_BALANCE=off)(FAILOVER=on) in the DESCRIPTION_LIST. But what if we do it the other way around? Let’s start by reversing the roles:
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production Copyright (c) 2000, 2013, Oracle. All rights reserved. Welcome to DGMGRL, type "help" for information. Connected as SYSDBA. DGMGRL> switchover to 'STDBY' Performing switchover NOW, please wait... New primary database "STDBY" is opening... Oracle Clusterware is restarting database "CDB" ... Switchover succeeded, new primary is "STDBY" DGMGRL> show configuration Configuration - martin Protection Mode: MaxPerformance Members: STDBY - Primary database CDB - Physical standby database Fast-Start Failover: DISABLED Configuration Status: SUCCESS (status updated 48 seconds ago)
Scenario 2: black-out of the primary cluster, database in primary role on standby cluster
The next is the worst case: a date centre has gone down, and none of the hosts in the other cluster (rac12pri) is reachable. To simulate this I simply shut down rac12pri1 and rac12pri2, that’s as dead as it gets. The SCAN will still resolve in DNS, but there is neither a SCAN VIP nor a SCAN LISTENER anywhere to answer on the primary cluster.
As you saw in the output leading to this paragraph the primary database is started on rac12sby, nodes rac12sby1 and rac12sby2. In the current format, there is – for the first time – a connection penalty:
preventde [oracle@rac12sby1 ~]$ time sqlplus system/secret@maa_test1 <<EOF > select sys_context('userenv','instance_name') from dual; > exit; > EOF SQL*Plus: Release 12.1.0.2.0 Production on Tue Apr 21 07:17:36 2015 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Mon Apr 20 2015 10:39:29 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> SYS_CONTEXT('USERENV','INSTANCE_NAME') -------------------------------------------------------------------------------- STDBY2 SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m35.770s user 0m0.024s sys 0m0.043s [oracle@rac12sby1 ~]$ time sqlplus system/secret@maa_test1 <<EOF select sys_context('userenv','instance_name') from dual; exit; EOF SQL*Plus: Release 12.1.0.2.0 Production on Tue Apr 21 07:18:39 2015 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Tue Apr 21 2015 07:18:09 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> SYS_CONTEXT('USERENV','INSTANCE_NAME') -------------------------------------------------------------------------------- STDBY2 SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m36.219s user 0m0.021s sys 0m0.022s
During testing with 12c these ranged between 30 seconds and 38. Here is proof that I truly cannot connect to the primary cluster:
[oracle@lab tns]$ ping rac12pri-scan PING rac12pri-scan.example.com (192.168.100.111) 56(84) bytes of data. From lab.example.com (192.168.100.1) icmp_seq=1 Destination Host Unreachable From lab.example.com (192.168.100.1) icmp_seq=2 Destination Host Unreachable From lab.example.com (192.168.100.1) icmp_seq=3 Destination Host Unreachable From lab.example.com (192.168.100.1) icmp_seq=4 Destination Host Unreachable ^C --- rac12pri-scan.example.com ping statistics --- 5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4001ms [oracle@rac12sby1 ~]$ time telnet rac12pri-scan 1521 Trying 192.168.100.112... telnet: connect to address 192.168.100.112: No route to host Trying 192.168.100.113... telnet: connect to address 192.168.100.113: No route to host Trying 192.168.100.111... telnet: connect to address 192.168.100.111: No route to host real 0m9.022s user 0m0.001s sys 0m0.003s
I found it interesting that the telnet command took almost no time at all to complete, yet trying to connect using SQL_Net it seems to take forever. Why could that be? Trace!
Getting to the bottom of this
Tracing can help sometimes, so I tried that. I try and keep my working TNS configuration separate from the production entries in $ORACLE_HOME. Using ~/tns I defined sqlnet.ora to read:
- diag_adr_enabled = off
- trace_level_client = support
- trace_directory_client = /home/oracle/tns
- log_directory_client = /home/oracle/tns
- trace_unique_client = true
And connected again. I noticed time increases whenever nsc2addr was invoked:
[oracle@lab tns]$ grep "nsc2addr.*DESC" cli_6630.trc | nl 1 (1947452928) [23-APR-2015 11:07:45:376] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 2 (1947452928) [23-APR-2015 11:07:48:383] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 3 (1947452928) [23-APR-2015 11:07:51:390] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 4 (1947452928) [23-APR-2015 11:07:54:396] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 5 (1947452928) [23-APR-2015 11:07:57:402] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 6 (1947452928) [23-APR-2015 11:08:00:408] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 7 (1947452928) [23-APR-2015 11:08:03:414] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 8 (1947452928) [23-APR-2015 11:08:06:421] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 9 (1947452928) [23-APR-2015 11:08:09:427] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 10 (1947452928) [23-APR-2015 11:08:12:433] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 11 (1947452928) [23-APR-2015 11:08:15:439] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 12 (1947452928) [23-APR-2015 11:08:18:445] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 13 (1947452928) [23-APR-2015 11:08:21:452] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.120)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
Sorry for the lengthy listing but that way it shows best! The primary SCAN is set to 111,112 and 113 with the standby SCAN on 119,120, and 121.
What you can see clearly by looking at the timestamps is that the transport_connect_timout works perfectly well-a new address is tried every 3 seconds. What surprises me is that each SCAN VIP is tried 4 times before the second SCAN is tried (lines 1-12). I thought I had that limited to 3 retries … Line 13 is the first reference to the second SCAN, and as soon as that is referenced a connection is established.
That can’t be it, can it?
Hmm maybe Oracle went a bit too far by adding 3 retries to every single SCAN VIP. The whole process can be sped up by trying every SCAN VIP only once before failing over to the second SCAN. This can be done with the following connection string:
MAA_TEST2 = (DESCRIPTION_LIST= (LOAD_BALANCE=off)(FAILOVER=on) (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3) (DESCRIPTION= (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST=rac12pri-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=rootsrv))) (DESCRIPTION= (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST= rac12sby-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=rootsrv))) )
Trying this works a lot better from a timing perspective:
[oracle@lab tns]$ time sqlplus system/secret@maa_test2<<EOF exit EOF SQL*Plus: Release 12.1.0.2.0 Production on Thu Apr 23 11:35:17 2015 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Thu Apr 23 2015 11:08:21 +01:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Advanced Analytics and Real Application Testing options real 0m9.829s user 0m0.041s sys 0m0.022s [oracle@lab tns]$
Visible in the SQL*Net trace:
[oracle@lab tns]$ grep "nsc2addr.*DESC" cli_7305.trc | nl 1 (3474875904) [23-APR-2015 11:35:17:935] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 2 (3474875904) [23-APR-2015 11:35:20:940] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 3 (3474875904) [23-APR-2015 11:35:23:948] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle)))) 4 (3474875904) [23-APR-2015 11:35:26:956] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.119)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
That was the solution to the “slow” connectivity that the person who asked me the question in Paris shared, but I haven’t been able to find proof as to why it works better: this is it. Now the question remains why the MAA string in the Application Continuity white paper is more complex? I think what’s failing here is the CONNECT_TIMEOUT. According to tnsnames.ora reference in 12c the purpose of the parameter is “To specify the timeout duration in seconds for a client to establish an Oracle Net connection to an Oracle database”. You could read this so that after CONNECT_TIMEOUT is reached the failover to the next SCAN occurs. However it appears – by looking at the trace – that the connection timeout is similar to transport timeout and the lower of both is used.
There is a high probability that the syntax is in the document there for a reason, although it escapes me right now. I guess the penalty when creating new connections with connection pools doesn’t really matter that much, because you create the pool (where you “pay” the penalty) only once, and subsequently pull connections from there, which should be very fast.
Nice post Martin.
Should the MAA_TEST2 example connection string read:
(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=1)
Rather than:
(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
Thanks
Dave
Hi Dave,
Contrary to what I first thought it’s not the case. I have just tested again (because I wasn’t sure anymore) and realised again that the move of the CONNECT_TIMEOUT outside the DESCRIPTION made the difference. The time needed to establish the connection is still roughly 9 seconds. Net*8 tries all 3 SCAN VIPs on the first cluster (3 x 3 seconds = 9 seconds) until the connection is established. That matches the trace output as well.
Hi Martin,
Thanks so much for your blog entry.
Do you happen to know what these numbers are in the first column of the sqlnet trace? I don’t mean the line numbers; I mean the numbers in parenthesis such as (1947452928) and (3474875904). The reason I’m asking is because a core file is getting dumped in our system whenever there are 2 different numbers in parenthesis in the sqlnet trace file. When there is only 1 number in the trace, we don’t get a core. My gut feeling is that these are thread IDs. Ideas?
Regards, Emily
Hi Emily,
I’m afraid I don’t know. But they must be some sort of identifier. If I find the time I’ll look at this in more detail.
Sorry!