How important is a Disaster Recovery site for you?
Posted by Martin Bach on April 17, 2014
I regularly read threads on the oracle-l mailing list, and occasionally feel very tempted to reply to one. Just recently I saw one that I liked a lot. It is specifically about using an Oracle Database Appliance (ODA) as a Disaster Recovery (DR) solution for an Exadata system. The Exadata configuration was not specified, I assume it was a smaller (eighth rack/quarter rack) configuration.
There were lots of arguments pro and against that Exadata->ODA architecture, and that leads to a broader question: how important is DR for your organisation? This blog post is about my personal experience, and probably strongly influenced by where I live in work (Europe), yours might be different.
About the original discussion
A number of replies stated that using an ODA as a DR solution is technically possible and acceptable if you don’t really plan to use it (and you have emails where you pointed out the shortcomings of that architecture).
The use of non-Exadata as a DR solution for Exadata is flawed in many points, and here is why:
Exadata, especially since 188.8.131.52.0, will make very elegant choices when it comes to caching data in the Smart Flash Cache. You will see very decent response times. You will also notice that smart scans will benefit from intelligent caching. This works so well that I had to rewrite some of my demos.
Smart scans and Smart Flash Cache do not exist outside the Exadata platform. Outside of Exadata you get no smart IO either, and no direct access to data compressed with Hybrid Columnar Compression.
The counter argument was that you might not need these anyway because these features are not used in your Exadata production system. But even if you don’t use HCC and smart scans in Exadata (I might ask you though-why?) you will feel the lack of the Smart Flash Cache, your ~1ms response times for many single block IOs might drop to ~6-8 ms most likely.
Others said if you have HCC compressed data then it has to be decompressed first before use (time/CPU intensive) and you might not have the space. If you have a 10x compression on a table that 100G table becomes 1TB. In the context of the ODA you might not even be able to decompress as disk space is limited compared to Exadata (even without HCC)
The broader discussion
The original poster had a very specific question about using an ODA (or other non-Exadata system) as a DR solution for an Exadata primary.
Abstracting from the original question I immediately thought about using non-identical hardware for DR. A common case I found in my career follows this pattern: the production system can’t cope with the workload anymore, so you get new kit. But since getting new kit is expensive, the budget owners (not DBAs!) decide to reuse the old production servers for DR.
I have seen this numerous times and always made sure I have an email trail where I said quite openly that this is not a Good Idea ™ and was overruled. I am a very cautious person.
So what would happen if you had to invoke DR? I haven’t seen this happen often (management often doesn’t hesitates to take that decision), but let’s assume you invoke DR on your old production servers. And guess what: the new hardware was so powerful that far less care was taken to ensure that code performs well-after all, the new hardware can deal with it. Except that the old hardware couldn’t in the first place, and that was the reason it was phased out. Oooops. DR that is inoperable is not really a solution.
How important is DR for you
This all boils down to the question “how important is DR for you”? I gathered that some users feel DR is just a tick in the box, and it isn’t really ever considered to be invoked.
In my opinion (and I formed that many years ago) DR is the only way to ensure business continuity for all but the smallest databases. When working with large data sets in the TB range it becomes quite unmanageable to fully restore that data and still meet the Recovery Time Objective (RTO). Modern technology gives us the opportunity to have an RTO of nearly 0 and critical applications surely need to meet this. In which case you can rule out a restore straight away.
But even if you had an hour to restore, it might be difficult to meet that target, depending on your backup strategy. Disk backups are great there but if the disks with your backups are gone, then you need to fall back to another restore method. Thirty minutes later it can become clear that at the current rate the 1 hour SLA can’t be met.
So you need to invoke DR, and the DR solution-again in my opinion-should perform just like production.
I should once more point out that the opinions listed here are mine. During every customer engagement I advocated the use of Data Guard or equivalent replication technology. I don’t believe a database restore is a viable option due to the time constraints around database operations, and sleep better knowing I have a working DR solution for my important systems.