Data Guard transport lag in OEM 12c

I have come across this phenomenon a couple of times now so I thought it was worth writing up.

Consider a scenario where you get an alert because your standby database has an apply lag. The alert is generated by OEM and when you log in and check-it has indeed an apply lag. Even worse, the apply lag increases with every refresh of the page! I tagged this as an 11.2 problem but it’s definitely not related to that version.

Here is a screenshot of this misery:

 Lag in OEM

Now there are of course a number of possible causes:

  • There is a lag
  • You are not using Real Time Apply

The first one is easy to check: look at the redo generation rate on the primary database to see if it’s any different. Maybe you are currently loading lots of data? Maybe a batch job has been initiated that goes over a lot of data… the possibilities are nearly endless.

Another, more subtle interpretation could be that you are not using Real Time Apply. How can you check? In the broker command line interface for example:

DGMGRL> show configuration

Configuration - test

  Protection Mode: MaxPerformance
  Databases:
    pri - Primary database
      Warning: ORA-16789: standby redo logs not configured

    sby - Physical standby database
      Warning: ORA-16789: standby redo logs not configured

Fast-Start Failover: DISABLED

Configuration Status:
WARNING

The warnings about missing standby redo logs show that you cannot possibly use Real Time Apply (it needs standby redo logs). The other option is in the database itself:

SQL> select dest_id,status,database_mode,recovery_mode
  2  from v$archive_dest_status
  3  where status <> 'INACTIVE';

   DEST_ID STATUS    DATABASE_MODE   RECOVERY_MODE
---------- --------- --------------- -----------------------
         1 VALID     MOUNTED-STANDBY MANAGED
        32 VALID     UNKNOWN         IDLE

Did you notice dest_id of 32? That’s a bit of an unusual one, more on that later (since you can only set log_archive_dest_x where x is {1,31}).

So indeed we have managed recovery active, but not using Real Time Apply. This is expressed in the database status:

DGMGRL> show database verbose sby

Database - sby

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   28 seconds
  Apply Lag:       28 seconds
  Real Time Query: OFF
  Instance(s):
    sby

A few moments later when you query the database again the lag has increased:

DGMGRL> show database verbose sby

Database - sby

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   3 minutes 22 seconds
  Apply Lag:       3 minutes 22 seconds
  Real Time Query: OFF
  Instance(s):
    sby

This is to be expected-the primary is still happily processing user requests. The cure is to add standby redo logs, as suggested in so many places and described in the Data Guard documentation. After the successful addition of SRLs the lag should disappear. A restart of managed recovery using the broker will show something along these lines on the standby:

2014-01-30 14:35:18.353000 +00:00
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE  THROUGH ALL SWITCHOVER DISCONNECT  USING CURRENT LOGFILE
Attempt to start background Managed Standby Recovery process (sby)
MRP0 started with pid=24, OS id=4854
MRP0: Background Managed Standby Recovery process started (sby)
2014-01-30 14:35:23.406000 +00:00
 started logmerger process
Managed Standby Recovery starting Real Time Apply
...
2014-01-30 14:37:12.595000 +00:00
Media Recovery Waiting for thread 1 sequence 20 (in transit)
2014-01-30 14:37:13.691000 +00:00
Recovery of Online Redo Log: Thread 1 Group 5 Seq 20 Reading mem 0
  Mem# 0: +DATA/sby/onlinelog/group_5.266.838218819

Two important bits of information are shown here: Managed Standby Recovery starting Real Time Apply and the fact that it is using the standby redo log. Sure enough, after the database is in sync with its primary and uses the log, the lag is gone:

DGMGRL> show database verbose sby

Database - sby

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   0 seconds
  Apply Lag:       0 seconds
  Real Time Query: OFF
  Instance(s):
    sby

And also in the OEM view:

OEM-lag-02

Slight Variation

I have also seen this problem in OEM where the transport lag was near 0 and therefore hardly visible due to the scale of the graph. The apply lag nevertheless resulted from the primary working and the current log hasn’t shipped to the standby-obviously before the implementation of standby redo logs. You saw a spike mounting in the OEM view until the next log switch on the primary when the apply lag dropped to 0 for a brief moment before increasing again.

Summary

Real Time Apply is a very very useful feature, especially when used together with the maximum availability protection mode. The real risk of not using standby redo logs – and implicitly no RT Apply – is that you lose data since the current online redo log on the primary has not been copied across. If you need to activate your standby you will be some transactions short of the primary. The larger the online redo log, the larger the gap.

Advertisement

9 thoughts on “Data Guard transport lag in OEM 12c

  1. Stefan Koehler

    Hi Martin,
    i may misunderstand your summary, but RT apply has no risk to data loss at all. The primary database can use LGWR SYNC mode, but your standby can run in “old fashion recovery mode”, which basically means that every transaction is safe in the standby log file (sync with primary), but currently not applied until archived. The main benefit of RT apply is that your switch-over / fail-over is much faster.

    Regards
    Stefan

    1. Martin Bach Post author

      Hi Stefan!

      You are correct. When writing the summary I continued the example I started in scenario 2 (not using standby redo logs). Since the broker always uses RT Apply when it finds suitable standby redo logs not using RT Apply to me implied not having SRLs. I have updated the post in the hope of clarifying things. Thanks for pointing this out.

      Martin

  2. Adrian Angelov

    Thanks for the useful post, Martin!

    I always wondered why MAXAVAILABILITY configuration is considered ‘no data loss’ solution … I mean the case where you have only one standby database (which from my experience is pretty much the standard).
    Let’s say you’ve configured the primary and the standby properly (standby control files, Data Guard Broker MAXAVAILABILITY configuration (LGWR SYNC) and etc.). This guarantees you that in case of primary failure all committed data is on the standby – in this case all is good – there is no data loss and failover to the standby is perfectly ok.

    But the problem is that if you have a network interruption just before the primary db/host/data center failure, and this interruption was present at the moment when the primary db/host/data center failed , it’s almost 100% guaranteed that you’ll have data loss if you activate the standby. This is because once the network problem occurred the primary automatically switched from MAXAVAILABILITY to MAXPERFORMANCE mode and there will be committed transactions on the primary db that are not applied on the standby. In the case where you don’t have access to the primary db alert log, you won’t be able to verify whether MAXAVAIL -> MAXPERF switch occurred and activation of the standby database might lead to loss of data. Really bad thing to think about during such stressful moments. And what about – ‘Boss, we’re in a ‘no data loss’ mode but … ’

    There is no doubt that 2 MAXAVAILABILITY standby databases in different data centers are better than 1 and will prevent the MAXPERF->MAXAVAIL switch in the case where the network connectivity to one of the standby databases fail just prior primary db failure. The cost of doubling the standby databases (+ licenses, hardware, another data center) is much bigger but after all we’re supposed to provide ‘no data loss’ solution and this is the initial plan since we moved to MAXAVAILABILITY.

    Curious enough what your thoughts on the above are, what other workarounds might help from pure DBA perspective (ignoring the network availability capabilities)
    .
    Thank you

    1. Martin Bach Post author

      Hi Adrian!

      Thanks for your thoughtful comment, with many great points! Let me try and answer your question, beginning with the “no data loss” attribute of the Maximum Availability mode that you mention. I don’t think that Maximum Availability is a “no data loss” configuration, the 11.2 Data Guard Concepts and Administration guide states:

      “Maximum Availability

      This protection mode provides the highest level of data protection that is possible without compromising the availability of a primary database.”

      It is my understanding that Maximum Availability acts just like Maximum Protection if networking is ok (LGWR SYNC etc) except that “If the primary database cannot write its redo stream to at least one synchronized standby database, it operates as if it were in maximum performance mode to preserve primary database availability until it is again able to write its redo stream to a synchronized standby database.” (again a literal quote from the documentation from here http://docs.oracle.com/cd/E11882_01/server.112/e25608/protection.htm#SBYDB4743)

      So where in Maximum Protection “the primary database will shut down, rather than continue processing transactions, if it cannot write its redo stream to at least one synchronized standby database” in Maximum Availability is carries on with the risks you describe. You may get lucky though an flush redo to the standby since 11.2 (ALTER SYSTEM FLUSH REDO)

      In my career I observed that most sites used just one standby database as you describe. Ideally you would have a second standby to protect from the failure of the WAN link. On the other hand, I personally think that Maximum Protection is a bit too drastic: shutting down the primary database is an outage, which may not have to be taken if the network problem is only a short glitch. It all depends on how important the data is I guess. Let me know if that answers your question,

      Martin

      1. Adrian Angelov

        Thanks Martin,

        The snippets from the documentation are really important and they should be properly presented so that the management is aware of such catastrophic possibilities even when in MAXAVAILABILITY.

        I also think that MAXPROTECTION is too drastic and not used often in the real world (uptime requirements get higher and higher).

        I’m not sure if it is possible at all but if Oracle is able to implement ‘better’ logging of such primary db DG Broker mode changes (MAXAVAIL MAXPERF ), logged on the standby site, this will give additional information to take into consideration when failover is to be executed. Seems tough to me since the standby should somehow determine whether the primary has made a mode switch, without having access to the primary. Maybe some kind of primary db pinging, done by the standby, and if the primary is not present for a specified amount of time, the standby to log in its alert log file that there is a possibility that the primary switched to MAXPERF because of network connectivity issues (things get even more complicated in the case of more than one standby).
        Having this information in the standby alert log, the DBA will at least know that there were network connectivity issues before the primary failure which might have triggered DG Broker mode switch. And she/he will know whether the failover might be with data loss or not. In the case where data loss is possible, it might be decided to wait for some more time to get the primary db hard disk drives to the standby site (if at all possible, depends on the type of disaster) and try to get the redo logs on the standby site so that there is no data loss for sure when recovering from them.
        I mean, the DBA will have two options in case of network connectivity issues just before the failure:
        – less downtime; data loss possibility after the failover is present
        – lengthy downtime; no data loss if the primary db redo is made available on the standby site

        Maybe the above is a total non-sense and too paranoid but every time I had to do database failovers, I had this bad feeling of uncertainty that some information might have been lost.

        Thank you and sorry for the lengthy comments( I’ll deep dive into Data Guard Broker details and commands so I learn how to express things with one sentence only ;) ).

        Thanks again

    2. Stefan Koehler

      Hi Adrian,
      just in addition to Martin’s reply.

      > But the problem is that if you have a network interruption just before the primary db/host/data center failure …

      Please do not forget the “NetTimeout”. From an application point of view there is no data loss as the commit is hanging NetTimeout seconds (http://docs.oracle.com/cd/E11882_01/server.112/e40771/dbpropref.htm#DGBKR980) in case of a network interruption. So basically the transaction is not committed at all as the session is still waiting for the return of the commit when the primary db/host/data center is going down. Of course this scenario only works, if the network interruption and following primary outage is within NetTimeout seconds. The application has to capture the TNS error (after primary outage) and know that data was not committed yet.

      In my experience there is another reason for running a MAXAVAILABILITY configuration (no matter which transfer mode is used) related to daily business operation. You have an additional protection against human errors with such a configuration as for example you are not able to accidentally disable the standby database in MAXAVAILABILITY mode, if you only have one.

      Regards
      Stefan

      1. Adrian Angelov

        Hi Stefan,

        thanks for the response. I fully agree with you and really appreciate the existence of the MAXAVAILABILITY option and its usefulness .

        I just wanted to emphasize on the case where one has only one standby in MAXAVAILABILITY and is affected by the following sequence of events:

        – network issue followed by
        – Net Timeout followed by
        – MAXAVAILABILITY to MAXPERFORMANCE automatic reconfig on the primary db
        – primary db/host/data center failure and a need of failover to the DR site
        – activation of the standby as the new primary db.

        and the fact that in this case data loss will most probably be present.

        I wanted to know the opinion of Martin and other readers of this blog since this case is something that bothers me for some time.

        Thank you

        Regards,
        Adrian Angelov

  3. Uwe Hesse

    The point is simply that with Maximum Availability, zero data loss is guaranteed if only one component fails.

    Components being Primary, Network, Standby. No two of those mail fail at the same time.
    Safeguard against data loss in case of multiple components fail is having multiple standby databases.
    Only then it is suitable to increase to Maximum Protection, because we don’t want to lose primary functionality if the standby is down.

    You knew that, but just to bring it to the point :-)

    Kind regards
    Uwe Hesse

  4. Pingback: Data Guard apply lag gap troubleshooting

Comments are closed.