Martins Blog

Trying to explain complex things in simple terms

Get a feel for enterprise block level replication using drbd

Posted by Martin Bach on July 2, 2012

I didn’t really have a lot of exposure to block-level replication on the storage level before an engagement in the banking industry. I’m an Oracle DBA, and I always thought: why would I want to use anything but Oracle technology for replicating my data from one data centre to another? I need to be in control! I want to see what’s happening. Why would I prefer storage replication over Data Guard?

For a great many sites Data Guard is indeed all you need. Especially if you don’t have a storage array with a replication option. But many large enterprises have historically used large storage area networks with many enterprise features, including block level replication from array to array. They all come under their own name, and all major storage vendors have them. With the risk of speaking too generally, all of the block level replication allows you to somehow copy data from array A in data centre A to array B in data centre B. The data centres are usually geographically dispersed so as to avoid the impact of catastrophes. The storage replication happens without any DBA intervention or even visibility, harking back to the 90s mantra of “storage administrator does storage, system administrator does the OS and the database administrator works on the database”. I have written about this in the context of Exadata before.

Taking the responsibility of replication away from the DBA can sound attractive to the DBA managers: if it goes all wrong (and that includes human error more than technical problems, at least with the technology I was working with) then it’s not their fault. What is missed from this point of view though is that it is ultimately the DBA’s responsibility to get the system back, regardless of how long that takes. And I’d rather use OEM or another monitoring solution and proactively prevent the problem before it happens. Nothing is worse than going to DR and then having to find out that a LUN hasn’t been replicated and the volume group cannot be mounted-time to restore backup! But like Noons points out in the below comment this is down to how well you know and use your technology.

There are actually good arguments for the use of storage replication! Like you will see below you can mirror an Oracle home and databases to remote hosts, and unlike with the DRBD you do this on the storage array. The mapping of LUNs to a host is not as static in real life as shown here-DRBD really mirrors between hosts and not between storage. So mirroring a Xen domain for example allows you  to continuously store it on a different host, and if some simple prerequisites are met, start it up without too many problems. That could include the whole stack!

Why this post?

The reason for this post was simple: I want to experiment how storage replication works, especially when it comes to Oracle. And since I can’t buy myself a VMAX or HDS 9000 for use in the house I have to improvise. And since I am seriously in love with Linux, there is always a project at hand which makes such improvising possible. The project this time is DRBD:

The Distributed Replicated Block Device (DRBD) is a software-based, shared-nothing, replicated storage solution mirroring the content of block devices (hard disks, partitions, logical volumes etc.) between hosts.

The architecture is nicely visualised on the project’s website:

As you can see it uses a kernel module to send data with destination “local disk” over the wire to a remote server where it is read and written to disk as well. Sounds like some enterprise software to me, at least in principle. And again, I wouldn’t use this in production for an Oracle workload.

What I want

I would like to test a common scenario where the Oracle binaries as well as the databases of a Linux host are replicated to a host in the DR-data centre. For this purpose, I have two virtual machines running in my lab server on Xen 4.1.2 with kernel 3.1.10-1.9-xen on my OpenSuSE 12.1 lab server. The domUs are Oracle Linux 6.2 with kernel 2.6.32-300.3.1.el6uek.x86_64, named ol62drbd1 and ol62drbd2.

Both domUs have two NICs. The primary node, ol62drbd1 has the public address of 192.168.99.72 and a storage network address of 192.168.100.72. The standby node has 192.168.99.73 as private, and 192.168.100.73. As per my standard build I have a 20G “LUN” for an Oracle volume group. To be a little bit more practical volume group is split into 2x10G logical volumes: orabin_lv and oradata_lv. If not replicated, they will be mounted to /u01/app and /u01/oradata respectively.

Get DRBD to play with Oracle Linux and UEK1

To start with it: the DRBD documentation is excellent. Sadly there was no RPM for Oracle Linux and kernel UEK so I had to build the software from the source. (It seems that kernel UEK2 has experimental support for DRDB!) I say sadly-of course it was not. My system had the oracle preinstall PRM installed already, which includes the main development tools as dependencies. The only additional RPMs I had to install were flex and kernel-uek-devel for the kernel headers.

I also had to ensure the the ifcfg-eth1 scripts on both domUs had the ONBOOT flag set to YES. For some reason Oracle Linux allows you to configure devices but then doesn’t automatically enabled them when the system boots.

Following the excellent documentation I downloaded the source for drbd-8.4.1 to /usr/src and unpacked it. The configure script has a lot of options, which are well documented. In my case I went for these options:

$ cd /usr/src/drbd-8.4.1
$ ./configure --localstatedir=/var --sysconfdir=/etc --with-km=yes --with-heartbeat=no --with-pacemaker=no

Although drbd is part of the mainline kernel since 2.6.33 it doesn’t mean it is available in Oracle Linux and kernel UEK: that’s why you have to create the kernel module using the “–with-km” option. I didn’t need support for pacemaker or heartbeat so those were left out of scope.

Create the userland utilities with “make && make install”, and the kernel module with a

$ cd drbd && make clean all

That’s it-you built drdb. Obviously you’d repeat on the other node. I tried building the RPM but that failed since Oracle Linux is not a supported configuration in drbd…

Configuring the nodes

That’s actually quite simple, again thanks to the excellent documentation. What I learned was that you define just one resource! But I’m getting ahead of myself. The drbd configuration has to be identical on both nodes. The main configuration file is /etc/drbd.conf:

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

As you can see it references files in /etc/drdb.d, as it is custom in Linux. The global_common.conf has literally been taken from the documentation:

global {
  usage-count yes;
}
common {
  net {
    protocol C;
  }
}

This very simple file in essence sets the replication mode to sync. The actual resource we want to replicate is defined in r0.res:

resource r0 {
  volume 0 {
    device    /dev/drbd1;
    disk      /dev/oraclevg/orabinlv;
    meta-disk internal;
  }
  volume 1 {
    device    /dev/drbd2;
    disk      /dev/oraclevg/oradatalv;
    meta-disk internal;
  }
  on ol62drbd1.localdomain {
    address   192.168.100.72:7789;
  }
  on ol62drbd2.localdomain {
    address   192.168.100.73:7789;
  }
}

Also quite readable: I want to replicate two volumes, orabin_lv and oradata_lv and map those to /dev/drbd1 and /dev/drbd2 respectively. Once the configuration is ready on the first node, scp it across to the second node.

Initialise the resource for first time use

Do this on both nodes!

[root@ol62drbd2 drbd.d]# drbdadm create-md r0
md_offset 10737414144
al_offset 10737381376
bm_offset 10737053696
Found some data
 ==> This might destroy existing data! <==
Do you want to proceed?
[need to type 'yes' to confirm] yes
Writing meta data...
initializing activity log
NOT initializing bitmap
New drbd meta data block successfully created.
md_offset 10733219840
al_offset 10733187072
bm_offset 10732859392
Found some data
 ==> This might destroy existing data! <==
Do you want to proceed?
[need to type 'yes' to confirm] yes
Writing meta data...
initializing activity log
NOT initializing bitmap
New drbd meta data block successfully created.

r0 in this command refers to the resource definition in /etc/drbd.d/r0.res by the way. You really have to be sure that you don’t have data on your underlying block devices! Since I only just created my logical volumes that was a given for me.

You can then “up” your resources on both cluster sides:

[root@ol62drbd1 drbd.d]#  drbdadm up r0

I obviously forgot to load the kernel module first, but that’s easily changed:

[root@ol62drbd1 drbd.d]# drbdadm up r0
Could not stat("/proc/drbd"): No such file or directory
do you need to load the module?
try: modprobe drbd
Command 'drbdsetup new-resource r0' terminated with exit code 20
drbdadm: new-minor r0: skipped due to earlier error
drbdadm: new-minor r0: skipped due to earlier error
[root@ol62drbd1 drbd.d]# modprobe drbd
[root@ol62drbd1 drbd.d]# echo $?
0
[root@ol62drbd1 drbd.d]# lsmod | grep drbd
drbd                  245440  0
libcrc32c               1220  1 drbd

Using the /proc/drbd file you can check the status:

[root@ol62drbd1 drbd.d]# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@ol62drbd1.localdomain, 2012-06-29 12:47:26
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:10485404
 2: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:10481308
[root@ol62drbd1 drbd.d]#

The status is pretty inconsistent at this stage, but that’s expected. I have to force the information about the primary side right now. This command must not be executed once the system is initialised! And also only from one node.

[root@ol62drbd1 drbd.d]# drbdadm primary --force r0
[root@ol62drbd1 drbd.d]# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@ol62drbd1.localdomain, 2012-06-29 12:47:26
 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:244632 nr:0 dw:0 dr:250264 al:0 bm:14 lo:0 pe:4 ua:20 ap:0 ep:1 wo:f oos:10241692
        [....................] sync'ed:  2.4% (10000/10236)M
        finish: 0:06:59 speed: 24,368 (24,368) K/sec
 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:135936 nr:0 dw:0 dr:140696 al:0 bm:8 lo:0 pe:0 ua:16 ap:0 ep:1 wo:f oos:10345372
        [....................] sync'ed:  1.4% (10100/10232)M
        finish: 0:12:40 speed: 13,592 (13,592) K/sec
[root@ol62drbd1 drbd.d]#

All right, it’s syncing! Actually it’s copying empty tracks to empty tracks. There are ways to speed this up which I didn’t explore.

Eventually this process will finish, and the output is as shown:

[root@ol62drbd1 drbd.d]# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@ol62drbd1.localdomain, 2012-06-29 12:47:26
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:10485404 nr:0 dw:0 dr:10486088 al:0 bm:640 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:10481308 nr:0 dw:0 dr:10481992 al:0 bm:640 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
[root@ol62drbd1 drbd.d]#

Now we are up to date! Ready to create some data.

Create mount points

DRBD will create a few new block devices, /dev/drbd1 and /dev/drbd2 as defined in /etc/drbd.d/r0.res. It also creates symbolic links in /dev/drbd:

[root@ol62drbd1 drbd.d]# ls -lR /dev/drbd
/dev/drbd:
total 0
drwxr-xr-x. 3 root root 60 Jun 30 22:16 by-disk
drwxr-xr-x. 3 root root 60 Jun 30 22:16 by-res
/dev/drbd/by-disk:
total 0
drwxr-xr-x. 2 root root 80 Jun 30 22:16 oraclevg
/dev/drbd/by-disk/oraclevg:
total 0
lrwxrwxrwx. 1 root root 14 Jun 30 22:17 orabinlv - ../../../drbd1
lrwxrwxrwx. 1 root root 14 Jun 30 22:17 oradatalv - ../../../drbd2
/dev/drbd/by-res:
total 0
drwxr-xr-x. 2 root root 80 Jun 30 22:16 r0
/dev/drbd/by-res/r0:
total 0
lrwxrwxrwx. 1 root root 14 Jun 30 22:17 0 - ../../../drbd1
lrwxrwxrwx. 1 root root 14 Jun 30 22:17 1 - ../../../drbd2
[root@ol62drbd1 drbd.d]#

I initially wanted to create logical volumes on top of /dev/drbd1 and /dev/drbd2 but that was classified as “advanced” in the documentation so I simply didn’t.

Instead I created the file systems on /dev/drbd1 – ext3. I Initially created /dev/drbd2 as an xfs but that caused crashes and kernel panics which I didn’t investigate further. The updated /etc/fstab has the following new lines:

/dev/drbd1              /u01/app                ext3    defaults        1 0
/dev/drbd2              /u01/oradata            ext3    defaults        1 0

It is very important to set the sixth field to a “0″, as shown here. Otherwise you’d bump into system maintenance mode when booting since the devices aren’t ready for fsck!

Add drbd service to boot

To ensure that the kernel module is loaded at boot you should add the /etc/init.d/drbd script. Nowadays that’s easy!

# chkconfig --add drbd
# chkconfig drbd on

Note: this might not be the best way to do this on Oracle Linux 6 since we’re using upstart, but I didn’t check if chkconfig has been amended accordingly, possibly not.

Now when your system reboots it sadly hasn’t preserved the status of the block devices, i.e. you have two secondary devices. You then have to manually assess the situation and set on to primary. At this stage pacemaker or heartbeat would set in. Or any other cluster manager.

The situation

The LUNs on my standby host are not read-writable, which again is quite similar to enterprise replication software:

[root@ol62drbd2 ~]# mount /dev/drbd1
mount: block device /dev/drbd1 is write-protected, mounting read-only
mount: Wrong medium type
[root@ol62drbd2 ~]#

I have since created a database and binaries on the primary node. The status is as follows:

[root@ol62drbd1 ~]# drbd-overview
  1:r0/0  Connected Primary/Secondary UpToDate/UpToDate C r----- /u01/app     ext3 9.9G 4.2G 5.2G 45%
  2:r0/1  Connected Primary/Secondary UpToDate/UpToDate C r----- /u01/oradata ext3 9.9G 1.8G 7.6G 20%

ol62drbd1 is the primary. Let’s switch over! First, gently, after all it’s the first time I’m doing it

[oracle@ol62drbd1 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Mon Jul 2 12:04:47 2012
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL select host_name from v$instance;
HOST_NAME
----------------------------------------------------------------
ol62drbd1.localdomain
SQL shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL
[oracle@ol62drbd1 ~]$ lsnrctl stop listener_drbd
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 02-JUL-2012 12:05:56
Copyright (c) 1991, 2011, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=ol62drbd1.localdomain)(PORT=1523)))
The command completed successfully
[oracle@ol62drbd1 ~]$

Nothing running on my mount points:

[root@ol62drbd1 ~]# fuser -m /u01/app /u01/oradata

Prepare the switchover:

[root@ol62drbd1 ~]# umount /dev/drbd1
[root@ol62drbd1 ~]# umount /dev/drbd2

And switch! The below listing includes information from both hosts

[root@ol62drbd1 ~]# drbdadm secondary r0
# on the now primary site
[root@ol62drbd2 ~]# drbdadm primary r0
[root@ol62drbd2 ~]# drbd-overview
  1:r0/0  Connected Primary/Secondary UpToDate/UpToDate C r-----
  2:r0/1  Connected Primary/Secondary UpToDate/UpToDate C r-----
[root@ol62drbd2 ~]# mount /u01/app
[root@ol62drbd2 ~]# mount /u01/oradata/

Resume operations

Now it’s time to start again. It turned out that I had to run /u01/app/oraInventory/orainstRoot.sh and /u01/app/oracle/product/11.2.0.3/root.sh on my standby host, before starting. I also needed a new listener.ora file with the new hostname-all things that ought to be addressed. Now after the listener was up I started the database. Cool!

What to do next

This post is already too long, but there are things worth addressing:

  • The switchover isn’t seemless-the new host has a new IP address, requiring a change to DNS or the listener
  • The configuration is very basic and must be updated. With the current setup for example the drbd service will wait forever until all nodes are started.
  • Understand the volume resizing
  • and many more

Again, I would like to stress that such a configuration is not supported or suitable for most Oracle production databases. The above only serves as an example to make you understand what you can and cannot do with replication technologies.

About these ads

4 Responses to “Get a feel for enterprise block level replication using drbd”

  1. Noons said

    Martin, modern storage-level replication applied to databases uses a concept called “consistency groups” for synchronous replication that ensures some if not most of the disaster scenarios you described will not happen without awareness of a problem being there.
    It is essential to learn the details of all that is possible nowadays in modern SAN h/w and sw before claiming that it can or cannot be done or is “dangerous” or not.
    I do recall the horror stories of youre, let me assure you most of those scenarios are gone. It is a complex and wide field, believe me: I”ve been dabbling on it for the last 3 years and although I don’t profess to “know-it-all”, I do recognize it encompasses a lot more than I initally thought.
    And there is more than just synchronous replication possible. Asynchronous is very viable and reliable method for things like s/w storage, archive logs, FRA, etcetc.
    Particularly if preceeded by the caution of taking a snapshot on the remote site before effecting the replication – a la “before image”. And so on.
    A lot can be said about this subject, thanks for taking the step to talk about it.

  2. Kabbo said

    Hi Martin,

    As usual, I always read your very nice and up to date articles, bravo to you and keep up the good job of informing your fellow Database Analysts like myself. I just notice a very tiny “typo”, I believe. You might want look at this:

    The domUs are Oracle Linux 6.2 with kernel 2.6.32-300.3.1.el6uek.x86_64, named ol62drbd1 and ol62drbd1. The second hostname should be ol62drbd2, as opposed to 1.

    thanks,
    Kabbo

  3. goran said

    Hi Martin,

    I absolutely agree with your statement “it is ultimately the DBA’s responsibility to get the system back” … and everybody expect that from us!

    As of this, I am pretty resistant on using storage replication … my prefered option is DataGuard.

    I had a horror story couple a years ago when our team was persuaded by storage guys to use it for replicating database from Frankfurt to Munich (Germany) … and never again.

    Unless I have a very very deep insight into details of underlaying storage technology and thoroughly tested, I choose safe way: DataGuard … in case I can decide which solution … if not, then good luck ;-)

    Further, from organization point it’s better to have competence and responsibility in only one team for switch/failover rather then having it spread across different organizational units.

    regards,
    goran

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 2,082 other followers

%d bloggers like this: