Deploying I/O intensive workloads in the cloud: mdadm (aka Software) RAID

The final part of my “avoiding pitfalls with Linux Logical Volume Manager” (LVM) series considers software RAID on Oracle Linux 8 as the basis for your LVM’s Physical Volume (PV). It’s still the very same VM.Standard.E4.Flex running Oracle 19.12.0 on top of Oracle Linux 8.4 with UEK6 (5.4.17-2102.203.6.el8uek.x86_64) I used for creating the earlier posts.

Previous articles in this series can be found here:

Storage Configuration

Rather than using LVM-RAID as in the previous article, the plan this time is to create a software RAID (pseudo-device) and use it as a Physical Volume. This is exactly what I have done before I learned about LVM RAID. Strictly speaking, it isn’t necessary to create a Volume Group on top of a RAID device as you can absolutely use such a device on its own. Having said that, growing a RAID 0 device doesn’t seem possible after my limited time studying the documentation. Speaking of which: you can read more about software RAID in Red Hat Linux 8 here.

In this post I’ll demonstrate how you could use a RAID 0 device for striping data across multiple disks. Please don’t implement the steps in this article unless software RAID is an approved solution in your organisation and you are aware of the implications. Kindly note this article does not concern itself with the durability of block devices in the cloud. In the cloud, you have a lot less control over the block devices you get, so make sure you have appropriate protection methods in place to guarantee your databases’ RTO and RPO. RAID 0 offers 0 protection from disk failure (it’s in the name ;), so as soon as you lose a disk from your software RAID, it’s game over.

Creating the RAID Device

The first step is to create the RAID device. For nostalgic reasons I named it /dev/md127, other sources name their devices /dev/md0. Not that it matters too much.

[opc@oracle-19c-fs ~]$ sudo mdadm --create /dev/md127 --level=0 \
> --raid-devices=2 /dev/oracleoci/oraclevdc1 /dev/oracleoci/oraclevdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md127 started.
[opc@oracle-19c-fs ~]$ 

As you can see from the output above mdadm created the device for me. If you wondered what the funny device names imply, have a look at an earlier post I wrote about device name persistence in OCI.

You can always use mdadm --detail to get all the interesting details from a RAID device:

[opc@oracle-19c-fs ~]$ sudo mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Fri Aug  6 14:15:12 2021
        Raid Level : raid0
        Array Size : 524019712 (499.74 GiB 536.60 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Fri Aug  6 14:15:12 2021
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : oracle-19c-fs:127  (local to host oracle-19c-fs)
              UUID : 30dc8f99...
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
[opc@oracle-19c-fs ~]$  

This is looking good – both devices are available and no errors have occurred.

Creating oradata_vg

With the future PV available it’s time to create the Volume Group and the Logical Volumes (LV) for the database and Fast Recovery Area. I’m listing the steps here for later reference, although they are the same as in part 1 of this article.

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 1) create the PV
[opc@oracle-19c-fs ~]$ sudo pvcreate /dev/md127
  Physical volume "/dev/md127" successfully created.

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 2) create the VG
[opc@oracle-19c-fs ~]$ sudo vgcreate oradata_vg /dev/md127
  Volume group "oradata_vg" successfully created

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 3) create the first LV
[opc@oracle-19c-fs ~]$ sudo lvcreate --extents 80%FREE --name oradata_lv oradata_vg 
  Logical Volume "oradata_lv" created

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 4) create the second LV
[opc@oracle-19c-fs ~]$ sudo lvcreate --extents 100%FREE --name orareco_lv oradata_vg 
  Logical volume "orareco_lv" created.

The end result are 2 LVs in oradata_vg:

[opc@oracle-19c-fs ~]$ sudo lvs oradata_vg
  LV         VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  oradata_lv oradata_vg -wi-a----- 399.79g                                                    
  orareco_lv oradata_vg -wi-a----- <99.95g   

That’s it! The LVs require file systems before they can be mounted (not shown here).

Trying it out

After the final touches have been applied I restored the database and started the familiar Swingbench workload to see which disks are in use. Right before I did that I ensured I’m not multiplexing control files/online redo logs in the FRA for test purposes only. NOT multiplexing control files/online redo log members is probably a Bad Idea for serious Oracle deployments but ok for this scenario.

I am expecting to see both block devices making up /dev/md127 used. And sure enough, they are:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   13/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.23    0.01    0.35    0.57    0.01   98.83

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              2.99    0.96      0.08      0.04     0.03     0.26  ...   0.21
dm-0             2.78    0.62      0.07      0.03     0.00     0.00  ...   0.20
dm-1             0.06    0.58      0.00      0.01     0.00     0.00  ...   0.02
sdb              1.28    0.22      0.06      0.00     0.00     0.02  ...   0.13
dm-2             1.26    0.24      0.06      0.00     0.00     0.00  ...   0.13
sdc            753.52   26.38      8.37      5.64    30.91     0.29  ...   7.36
md127         1573.79   53.30     17.44     12.01     0.00     0.00  ...   0.00
sdd            758.09   26.57      8.42      5.64    31.29     0.05  ...   9.34
sde             20.53    0.00      5.11      0.00     0.00     0.00  ...   1.79
dm-3            20.51    0.00      5.11      0.00     0.00     0.00  ...   1.79
dm-4          1558.54   28.25     12.20      5.97     0.00     0.00  ...   6.56
dm-5             4.69    2.61      4.58      5.26     0.00     0.00  ...   4.15

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.08    0.00    5.32    9.48    0.13   80.99

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    3.40      0.00      0.03     0.00     0.60  ...  0.08
dm-0             0.00    2.60      0.00      0.02     0.00     0.00  ...  0.08
dm-1             0.00    1.40      0.00      0.01     0.00     0.00  ...  0.04
sdc           16865.80  284.60    140.04      2.39  1059.60     0.20 ...  92.60
md127         36008.00  564.20    281.33      4.76     0.00     0.00 ...   0.00
sdd           16978.80  279.40    141.11      2.34  1081.40     0.00 ...  99.96
dm-4          36007.80  563.00    281.33      4.73     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.07    0.00    5.51   10.52    0.16   79.74

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdb              0.00    0.80      0.00      0.01     0.00     0.20  ...  0.04
dm-2             0.00    1.00      0.00      0.01     0.00     0.00  ...  0.04
sdc           17709.80  317.80    142.87      2.51   577.40     0.00 ...  93.90
md127         36657.80  661.60    286.41      5.31     0.00     0.00 ...   0.00
sdd           17790.00  343.40    143.69      2.77   599.00     0.00 ...  99.94
dm-4          36657.80  660.20    286.41      5.28     0.00     0.00 ... 100.00

[opc@oracle-19c-fs ~]$ 

No surprises here! Except maybe that /dev/md127 was somewhat underutilised ;) I guess that’s an instrumentation bug/feature. /dev/dm-4 – showing 100% utilisation – belongs to oradata_lv:

[opc@oracle-19c-fs ~]$ ls -l /dev/mapper | egrep dm-4
lrwxrwxrwx. 1 root root       7 Aug 13 09:37 oradata_vg-oradata_lv -> ../dm-4

Extending oradata_vg

Just as with each previous example I’d like to see what happens when I run out of space and have to extend oradata_vg. For this to happen I need a couple more block devices. These have to match the existing ones in size and performance characteristics for the best result. No difference to LVM-RAID I covered in the earlier article.

I created /dev/md128 in the same way as I did for the original RAID device and created a Physical Volume from it. oradata_vg looked like this prior to its extension:

[opc@oracle-19c-fs ~]$ sudo vgs oradata_vg
  VG         #PV #LV #SN Attr   VSize   VFree
  oradata_vg   1   2   0 wz--n- 499.74g    0 

In the next step I extended the Volume Group but only after I ensured I have a proven, working backup of everything. Don’t ever make changes to the storage layer without a backup and a known, tested, proven way to recover from unforeseen issues!

[opc@oracle-19c-fs ~]$ sudo vgextend oradata_vg /dev/md128
  Volume group "oradata_vg" successfully extended
[opc@oracle-19c-fs ~]$ sudo vgs oradata_vg
  VG         #PV #LV #SN Attr   VSize   VFree  
  oradata_vg   2   2   0 wz--n- 999.48g 499.74g

The VG now shows 2 PVs and plenty of free space. So let’s add 80% of the free space to oradata_lv.

[opc@oracle-19c-fs ~]$ sudo lvresize --extents +80%FREE --resizefs /dev/mapper/oradata_vg-oradata_lv
  Size of logical volume oradata_vg/oradata_lv changed from 399.79 GiB (102347 extents) to <799.59 GiB (204695 extents).
  Logical volume oradata_vg/oradata_lv successfully resized.
meta-data=/dev/mapper/oradata_vg-oradata_lv isize=512    agcount=16, agsize=6550144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=104802304, imaxpct=25
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=51173, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 104802304 to 209607680

The LV changes from its original size …

[opc@oracle-19c-fs ~]$ sudo lvs /dev/mapper/oradata_vg-oradata_lv
  LV         VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert                                                         
  oradata_lv oradata_vg -wi-ao---- 399.79g

to its new size:

[opc@oracle-19c-fs ~]$ sudo lvs /dev/mapper/oradata_vg-oradata_lv
  LV         VG         Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  oradata_lv oradata_vg -wi-ao---- <799.59g                                                    

The same applies to the file system as well:

[opc@oracle-19c-fs ~]$ df -h /u01/oradata
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/oradata_vg-oradata_lv  800G   38G  762G   5% /u01/oradata

Does that change performance?

Based on my experience with LVM-RAID I did not expect a change in performance as my database wasn’t yet at a stage where it required the extra space yet. My assumption was confirmed by iostat:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   13/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.98    0.01    1.44    2.35    0.03   95.18

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              2.32    0.99      0.06      0.03     0.02     0.27  ...  0.17
dm-0             2.16    0.61      0.06      0.03     0.00     0.00  ...  0.16
dm-1             0.05    0.62      0.00      0.01     0.00     0.00  ...  0.02
sdb              0.99    0.20      0.05      0.00     0.00     0.02  ...  0.11
dm-2             0.98    0.22      0.04      0.00     0.00     0.00  ...  0.11
sdc           4538.44   73.12     38.69      4.78   190.85     0.23  ... 26.27
md127         9485.50  147.14     78.09     10.13     0.00     0.00  ...  0.00
sdd           4562.89   73.73     38.90      4.79   193.25     0.04  ... 29.88
sde             15.87    0.00      3.95      0.00     0.00     0.00  ...  1.39
dm-3            15.86    0.00      3.95      0.00     0.00     0.00  ...  1.39
dm-4          9473.71  127.63     74.04      5.46     0.00     0.00  ... 27.74
dm-5             3.63    2.02      3.54      4.07     0.00     0.00  ...  3.21
sdf              0.07    0.00      0.00      0.00     0.00     0.01  ...  0.01
sdg              0.08    0.00      0.00      0.00     0.00     0.01  ...  0.00
md128            0.06    0.02      0.00      0.00     0.00     0.00  ...  0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.96    0.00    5.44    8.52    0.08   82.00

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdc           17652.60  306.80    141.15      2.52   414.40     0.00 ...  88.78
md127         36265.40  608.00    283.35      5.01     0.00     0.00 ...   0.00
sdd           17783.60  301.20    142.17      2.43   411.60     0.00 ... 100.00
dm-4          36267.40  607.00    283.37      4.95     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.20    0.00    5.45    8.82    0.14   81.38

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    1.20      0.00      0.01     0.00     0.00  ...  0.04
dm-0             0.00    1.00      0.00      0.01     0.00     0.00  ...  0.04
dm-1             0.00    0.20      0.00      0.00     0.00     0.00  ...  0.02
sdc           18145.40  332.20    143.99      2.55   284.40     0.00 ...  92.22
md127         36865.20  650.20    288.04      5.00     0.00     0.00 ...   0.00
sdd           18161.20  318.00    144.14      2.45   285.20     0.00 ...  99.98
dm-4          36863.20  649.00    288.02      4.99     0.00     0.00 ...  99.98

[opc@oracle-19c-fs ~]$ 

As long as there aren’t any database files in the “extended” part of the LV, there won’t be a change in performance. As soon as your database spills over to the “new” disks, you should see a benefit from the newly added /dev/dm128.

Summary

Just as LVM-RAID does, using software RAID allows you to benefit from striping data across multiple devices. The iostat output is quite clear about the benefit, just look at the figures for /dev/sdc, /dev/sdd and how they accumulate in /dev/md127.

Using software RAID doesn’t come without a risk, it’s entirely possible to lose a block device and thus the RAID device. It’s imperative you protect against this scenario in a way that matches your database’s RTO and RPO.

My main problem with the solution as detailed in this post is the lack of a re-balance feature you get with Oracle’s Automatic Storage Management (ASM). It’s still possible to have I/O hotspots after a storage space expansion.

Advertisement