I may have said it before but I consider presenting and teaching a great way to expand one’s knowledge: first of all it requires me to really understand a subject. Secondly, when presenting, you get lots of interesting questions that can turn into blog posts like this one.
Lately I have been asked about the impact of synchronous log shipping to a physical standby database. I was sure there was an effect to be observed, depending most likely on the network latency between systems but I didn’t have any evidence I could pull out of the hat to back up my thoughts. So what better than trying! I also read that some of the events have changed in 12c, and wanted to make them visible. My environment is based on the 2 node RAC primary/2 node RAC standby configuration I wrote about in my previous posts.
Since their initial setup I upgraded the cluster to 12.1.0.2.170117 for Clusterware and RDBMS.
I am planning to share my findings in two posts: the first one (the one you are reading) will present the baseline, the second one to follow shows how performance can be impacted by poor network performance. I guess the second one will be a bit longer …
Background
Data Guard allows the administrator to define the databases in a Data Guard configuration to be in one of three possible modes. The first one is Maximum Performance, the second Maximum Availability and finally Maximum Protection. All three modes are covered in the official documentation set and MOS. Without saying it very scientifically:
- Maximum Performance uses asynchronous log shipping and is the default. The performance of the standby database (or rather lack thereof) should not impact the performance of the primary. At a cost: it is not guaranteed that redo is actually written to the standby when the commit command returns.
- Maximum Protection allows you to enforce a Zero Data Loss (ZDL) strategy but also at a cost: The primary will shut down if it can’t transmit redo information to the standby. That is how you prevent data loss!
- Maximum Availability is the middle ground and I’ll focus on that protection mode here.
Make sure to read the documentation and understand the implications of each protection mode when assessing the most suitable mode for your environment!
Background: New Events
Before going into details I should probably point out that there are some new Data Guard events. Hurray! More events and more meaningful names, great. Just one caveat-have a look at the output from my 12c database:
select name, wait_class from v$event_name where name in ( 'ARCH Remote Write','ASYNC Remote Write','Redo Transport Attach', 'Redo Transport Close','Redo Transport Detach','Redo Transport Open', 'Redo Transport Ping','Redo Transport Slave Shutdown','Redo Transport Slave Startup', 'Redo Writer Remote Sync Complete', 'Redo Writer Remote Sync Notify', 'Remote SYNC Ping','SYNC Remote Write'); NAME WAIT_CLASS -------------------------------------------------- ---------------------------------------- ASYNC Remote Write Other SYNC Remote Write Other ARCH Remote Write Other Redo Transport Attach Other Redo Transport Detach Other Redo Transport Open Other Redo Transport Close Other Redo Transport Ping Other Remote SYNC Ping Other Redo Transport Slave Startup Other Redo Transport Slave Shutdown Other Redo Writer Remote Sync Notify Other Redo Writer Remote Sync Complete Other 13 rows selected.
Compare this to the pre 12c events (taken from an 11.2.0.3 database since they are mostly gone in 12c)
select name, wait_class from v$event_name where name in ( 'ARCH wait on ATTACH','ARCH wait on SENDREQ','ARCH wait on DETACH', 'LNS wait on ATTACH', 'LNS wait on SENDREQ','LNS wait on DETACH', 'LGWR wait on LNS','LNS wait on LGWR','LGWR-LNS wait on channel'); NAME WAIT_CLASS -------------------------------------------------- -------------------- ARCH wait on ATTACH Network LNS wait on ATTACH Network LNS wait on SENDREQ Network LNS wait on DETACH Network ARCH wait on SENDREQ Network ARCH wait on DETACH Network LNS wait on LGWR Network LGWR wait on LNS Network LGWR-LNS wait on channel Other 9 rows selected.
For the sake of completeness here are those events that remained in 12c:
select name, wait_class from v$event_name where name in ( 'ARCH wait on ATTACH','ARCH wait on SENDREQ','ARCH wait on DETACH', 'LNS wait on ATTACH', 'LNS wait on SENDREQ','LNS wait on DETACH', 'LGWR wait on LNS','LNS wait on LGWR','LGWR-LNS wait on channel'); NAME WAIT_CLASS ------------------------------ ------------------------------ LNS wait on LGWR Network LGWR wait on LNS Network LGWR-LNS wait on channel Other
Bugger. The new events are all in the “Other” wait class. My initial attempts at filtering information from v$event_histogram based on events with the “Network” class had little success. I guess friends of the OEM performance pages might also find that interesting. If you wonder from where I pulled these events – I used a white paper named “Best Practices for Synchronous Redo Transport” as reference.
Testing: The Environment
As you read in the introduction, this environment is based on a two node RAC primary -> two node RAC standby configuration:
DGMGRL> show configuration verbose Configuration - ractest Protection Mode: MaxPerformance Members: NCDBA - Primary database NCDBB - Physical standby database Properties: FastStartFailoverThreshold = '30' OperationTimeout = '30' TraceLevel = 'USER' FastStartFailoverLagLimit = '30' CommunicationTimeout = '180' ObserverReconnect = '0' FastStartFailoverAutoReinstate = 'TRUE' FastStartFailoverPmyShutdown = 'TRUE' BystandersFollowRoleChange = 'ALL' ObserverOverride = 'FALSE' ExternalDestination1 = '' ExternalDestination2 = '' PrimaryLostWriteAction = 'CONTINUE' Fast-Start Failover: DISABLED Configuration Status: SUCCESS DGMGRL> show database 'NCDBA'; Database - NCDBA Role: PRIMARY Intended State: TRANSPORT-ON Instance(s): NCDBA1 NCDBA2 Database Status: SUCCESS DGMGRL> show database 'NCDBB' Database - NCDBB Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds (computed 0 seconds ago) Apply Lag: 0 seconds (computed 0 seconds ago) Average Apply Rate: 177.00 KByte/s Real Time Query: OFF Instance(s): NCDBB1 NCDBB2 (apply instance) Database Status: SUCCESS DGMGRL> show database 'NCDBA' LogXptMode LogXptMode = 'async' DGMGRL> show database 'NCDBB' LogXptMode LogXptMode = 'async'
The configuration is currently set to “Maximum Performance”, databases ship redo asynchronously.
All systems are patched to the January 2017 Proactive Bundle Patch:
[oracle@rac12pri1 ~]$ $ORACLE_HOME/OPatch/opatch lspatches 24917972;Database PSU 12.1.0.2.170117, Oracle JavaVM Component (JAN2017) 24732082;Database Patch Set Update : 12.1.0.2.170117 (24732082) 24828633;OCW Patch Set Update : 12.1.0.2.170117 (24828633) OPatch succeeded.
Testing part I: The Baseline
I am using Swingbench’s Order Entry benchmark to generate a little bit of load. For the avoidance of doubt: I am not interested in a comparison between asynchronous and synchronous log shipping. I would like to see the effect caused by the difference in latency in the form of hugely increased waits. In my first test I am running with the defaults of my VM (the NIC uses virtio drivers), no change to SDU or send/receive buffers.
$ date Fri Feb 17 13:58:58 UTC 2017 $ ./charbench -cs //rac12pri-scan/swingbench_both \ > -uc 20 -r results_20_maxperf.xml -rt 00:10 Author : Dominic Giles Version : 2.5.0.971 Results will be written to results_20_maxperf.xml. Hit Return to Terminate Run... Time Users TPM TPS 1:59:55 PM 20 3852 255
On the apply side this translates to the following figures (the average apply rate is misleading and should be ignored; it captures all apply since MRP0 was started)
DGMGRL> show database 'NCDBB' Database - NCDBB Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds (computed 0 seconds ago) Apply Lag: 0 seconds (computed 0 seconds ago) Average Apply Rate: 713.00 KByte/s Real Time Query: OFF Instance(s): NCDBB1 NCDBB2 (apply instance) Database Status: SUCCESS
There is no transport lag, and no apply lag either. When checking for waits on background processes (using Tanel Poder’s ashtop – which requires you to have the Diagnostic Pack license as it access v$active_session_history) I noticed the following:
SQL> @ashtop program,event,wait_class,session_state session_type='BACKGROUND' "to_date('17.02.2017 09:02','dd.mm.yyyy hh24:mi')" "to_date('17.02.2017 09:07','dd.mm.yyyy hh24:mi')" Total Seconds AAS %This PROGRAM EVENT WAIT_CLASS SESSION FIRST_SEEN LAST_SEEN DIST_SQLEXEC_SEEN --------- ------- ------- ------------------------------------------------ ---------------------------------------- --------------- ------- ------------------- ------------------- ----------------- 53 .2 29% | oracle@rac12pri2 (LGWR) log file parallel write System I/O WAITING 2017-02-17 09:03:42 2017-02-17 09:06:53 1 23 .1 13% | oracle@rac12pri2 (LG00) log file parallel write System I/O WAITING 2017-02-17 09:02:03 2017-02-17 09:03:26 1 19 .1 10% | oracle@rac12pri2 (DBW0) ON CPU 2017-02-17 09:02:07 2017-02-17 09:06:39 1 12 .0 7% | oracle@rac12pri2 (LGWR) target log write size Other WAITING 2017-02-17 09:02:03 2017-02-17 09:03:27 1 9 .0 5% | oracle@rac12pri1 (LMS1) ON CPU 2017-02-17 09:04:37 2017-02-17 09:06:58 1 8 .0 4% | oracle@rac12pri2 (TT00) ON CPU 2017-02-17 09:02:15 2017-02-17 09:06:56 1 6 .0 3% | oracle@rac12pri1 (LG00) log file parallel write System I/O WAITING 2017-02-17 09:03:08 2017-02-17 09:06:44 1 6 .0 3% | oracle@rac12pri2 (LGWR) ON CPU 2017-02-17 09:04:05 2017-02-17 09:06:41 1 5 .0 3% | oracle@rac12pri2 (LG00) ON CPU 2017-02-17 09:02:02 2017-02-17 09:03:27 1 5 .0 3% | oracle@rac12pri2 (LMS0) ON CPU 2017-02-17 09:02:07 2017-02-17 09:06:14 1 5 .0 3% | oracle@rac12pri2 (LMS1) ON CPU 2017-02-17 09:02:24 2017-02-17 09:06:28 1 4 .0 2% | oracle@rac12pri1 (LMS0) ON CPU 2017-02-17 09:04:21 2017-02-17 09:06:05 1 3 .0 2% | oracle@rac12pri1 (LGWR) LGWR all worker groups Other WAITING 2017-02-17 09:03:08 2017-02-17 09:06:13 1 2 .0 1% | oracle@rac12pri1 (DBW0) db file parallel write System I/O WAITING 2017-02-17 09:03:08 2017-02-17 09:03:42 1 2 .0 1% | oracle@rac12pri2 (CKPT) ON CPU 2017-02-17 09:05:10 2017-02-17 09:05:13 1 2 .0 1% | oracle@rac12pri2 (TT00) log file sequential read System I/O WAITING 2017-02-17 09:03:17 2017-02-17 09:03:46 1 1 .0 1% | oracle@rac12pri1 (CJQ0) ON CPU 2017-02-17 09:04:44 2017-02-17 09:04:44 1 1 .0 1% | oracle@rac12pri1 (DBW0) ON CPU 2017-02-17 09:03:05 2017-02-17 09:03:05 1 1 .0 1% | oracle@rac12pri1 (LG00) ON CPU 2017-02-17 09:02:02 2017-02-17 09:02:02 1 1 .0 1% | oracle@rac12pri1 (LGWR) target log write size Other WAITING 2017-02-17 09:05:47 2017-02-17 09:05:47 1
The time difference between charbench and the database can be explained by time zones: my load generator is set to use UTC whereas the database is on EST; the result is a 6 hour time difference. This is a minor oversight on my part and has nothing to do with the location of the machines. In fact, they are all virtualised.
Summary Part I
In this part I tried to lay the foundation for the next one where I’m using my lab environment to simulate an artificially high network latency. I have to admit again that my setup is somewhat over-simplified-using Data Guard with asynchronous log shipping and Maximum Performance mode is not something I’d like to support for anything more important than my toy databases in the lab.
References
http://www.oracle.com/technetwork/database/availability/sync-2437177.pdf
Pingback: New Events for Data Guard and Synchronous Redo Transport in 12c (2) | Martins Blog