Deploying I/O intensive workloads in the cloud: LVM RAID

I recently blogged about a potential pitfall when deploying the Oracle database on LVM (Logical Volume Manager) with its default allocation policy. I promised a few more posts detailing how to potentially mitigate the effect of linear allocation in LVM. The post was written with the same Oracle 19.12.0 database deployed to Oracle Linux 8.4 with UEK6 on a VM.Standard.E4.Flex cloud system as used for creating the previous article.

If you found this article via a search engine, there are a few more posts about this topic here:

LVM RAID

In this post I’ll demonstrate how you could use LVM RAID level 0. Please don’t implement the steps in this article unless software (LVM-)RAID is an approved solution in your organisation and you are aware of the implications. Please note this article does not concern itself with the durability of block devices in the cloud. In the cloud, you have a lot less control over the block devices you get, so make sure you have appropriate protection methods in place to guarantee your databases’ RTO and RPO.

I found a hint in the SuSE Linux Enterprise Service 15 documentation recommending the use of software RAID over LVMRAID. I’ll leave that here as I don’t have sufficient information to deny or acknowledge that statement. I didn’t find a comparable warning in the Red Hat 8 documentation.

Implementing LVM RAID 0

The basics of LVM RAID levels are described in lvmraid(7):

lvm(8) RAID is a way to create a Logical Volume (LV) that uses multiple physical devices to improve performance or tolerate device failures. In LVM, the physical devices are Physical Volumes (PVs) in a single Volume Group (VG).

man 7 lvmraid

This is interesting and I haven’t really been aware of that not-really-new development. Previously I created a software RAID pseudo-device first, and used it as a physical volume in my LVM configuration. So instead of using a block device’s partition as a PV, I used the device created by mdadm (/dev/md0 for example). Let’s try the new way!

There were no changes required to oradata_vg on my Oracle Linux 8.4 system. The Logical Volume however was created differently. After struggling with the exact syntax for a bit I ended up with this command:

[opc@oracle-19c-fs ~]$ sudo lvcreate --type raid0 --extents 511998 --name oradata_lv \
> --stripesize 1m oradata_vg

Note that RAID 0 offers exactly 0 protection against disk failure. You need to ensure you have other means in place to guarantee your database’s RTO and RPO! I took me a little while to get the syntax for LVM RAID 0 right. The optional parameter –stripesize “specifies the Size of each stripe in kilobytes. This is the amount of data that is written to one device before moving to the next.” I’m unsure if 1 MB is the right value, I probably need to experiment with this a little more.

In the next step I created the XFS file system on top of the oradata_lv and mounted the new file system in /u01/oradata for use with the database.

The output of my lvs command changed quite a bit to what it was before:

[opc@oracle-19c-fs ~]$ sudo lvs --all --options name,copy_percent,devices,attr oradata_vg
  LV                    Cpy%Sync Devices                                       Attr      
  oradata_lv                     oradata_lv_rimage_0(0),oradata_lv_rimage_1(0) rwi-aor---
  [oradata_lv_rimage_0]          /dev/sdc1(0)                                  iwi-aor---
  [oradata_lv_rimage_1]          /dev/sde1(0)                                  iwi-aor---
[opc@oracle-19c-fs ~]$

The above output is specific to LVM RAID 0, higher RAID levels feature *_rmeta images in addition to the *_rimage above. Since I’m not planning on converting from RAID 0 to a higher RAID level I don’t need to concern myself with a meta image in this configuration. See lvmraid(7) for a more thorough description of LVM Sub-Volumes.

Since RAID 0 doesn’t offer any protection from disk failure it doesn’t have to wait for any synchronisation to be completed before making the volume available.

Disk Performance LVM RAID 0

After I finished the restore of my database to the newly created LVM RAID 0 mount point I ran the same Swingbench workload as before, still using the ridiculous small SGA forcing physical I/O. As in the previous article the aim wasn’t to see what the configuration is capable of, I wanted to find out more about disk utilisation.

This time iostat showed multiple busy devices:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   06/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.74    0.00    1.14    5.43    0.01   90.68

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              0.27    0.85      0.01      0.02     0.00     0.49  ...   0.05
dm-0             0.27    0.77      0.01      0.01     0.00     0.00  ...   0.04
dm-1             0.00    0.57      0.00      0.01     0.00     0.00  ...   0.01
sdb              0.11    0.11      0.00      0.00     0.00     0.02  ...   0.02
sdc            993.21   14.94     14.18      0.28     0.00     0.02  ...  15.28
dm-2             0.11    0.13      0.00      0.00     0.00     0.00  ...   0.02
sdd              0.25    4.95      0.24      0.35     0.00     0.01  ...   1.63
dm-3             0.25    4.95      0.24      0.35     0.00     0.00  ...   1.63
sde           1013.79  424.90     15.25      3.79     0.00     0.04  ...  25.97
dm-4           991.89   14.54     14.16      0.27     0.00     0.00  ...  15.12
dm-5           992.43   14.65     14.19      0.27     0.00     0.00  ...  15.13
dm-6          1984.31   29.19     28.35      0.54     0.00     0.00  ...  15.25

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.30    0.00    2.93   29.68    0.03   66.06

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdc           7210.60  119.80     56.23      0.90     0.00     0.00  ... 99.60
sdd              0.00   24.60      0.00      0.10     0.00     0.00  ...  7.60
dm-3             0.00   24.60      0.00      0.10     0.00     0.00  ...  7.60
sde           7204.80  102.60     56.20      0.82     0.00     0.00  ... 99.74
dm-4          7209.20  119.60     56.22      0.90     0.00     0.00  ... 99.60
dm-5          7205.40  102.60     56.21      0.82     0.00     0.00  ... 99.76
dm-6          14414.60  222.20    112.43      1.72     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.11    0.00    2.92   27.86    0.01   67.10

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdc           6771.60  103.60     52.81      0.62     0.00     0.00  ... 99.82
sdd              0.00   61.80      0.00      0.22     0.00     0.00  ... 18.02
dm-3             0.00   62.00      0.00      0.22     0.00     0.00  ... 18.02
sde           6806.20   45.80     53.09      0.49     0.00     0.00  ... 99.94
dm-4          6771.40  103.60     52.80      0.62     0.00     0.00  ... 99.82
dm-5          6806.00   45.80     53.09      0.49     0.00     0.00  ... 99.94
dm-6          13577.40  149.40    105.89      1.10     0.00     0.00 ... 100.00

In the above output, /dev/sdc1 and /dev/sde1 are part of oradata_vg, hosting the database. I still didn’t multiplex control files and online redo logs to ensure all I/O is reported against oradata_vg . At the risk of repeating myself not multiplexing control file/online redo log members might not be a good idea for serious Oracle deployments.

But what about /dev/dm-{4,5,6}? Why are there suddenly so many Device-Mapper devices in the above iostat output?

[opc@oracle-19c-fs ~]$ ls -l /dev/mapper | grep dm-[4-6]
lrwxrwxrwx. 1 root root       7 Aug  6 08:17 oradata_vg-oradata_lv -> ../dm-6
lrwxrwxrwx. 1 root root       7 Aug  6 08:15 oradata_vg-oradata_lv_rimage_0 -> ../dm-4
lrwxrwxrwx. 1 root root       7 Aug  6 08:15 oradata_vg-oradata_lv_rimage_1 -> ../dm-5
[opc@oracle-19c-fs ~]$ 

These match the previous output of the lvs command: all Device-Mapper meta-devices 4, 5 and 6 all belong to oradata_vg. Using the iostat output it should be apparent that more than 1 block device is used by the database, striping seems to be working fine.

What happens to performance when you extend the VG?

Assuming you run out of storage on your volume group, what next? With linear allocation it’s a no brainer: ensure the presence of a backup, then add another Physical Volume to the Volume Group and resize the Logical Volume + file system: capacity is increased immediately.

With LVM RAID 0 the story is a little different. According to the Red Hat 8 documentation it is possible to run lvresize on a striped LV provided the same number of stripes as originally present is added to the Volume Group. On my system I originally used 2 block devices = 2 stripes in oradata_vg. Adding a couple more of the same size and performance characteristics allows me to resize the Logical Volume after I ensured I had a proven and tested backup of all data depending on oradata_vg:

[opc@oracle-19c-fs ~]$ sudo lvresize --extents +461996 --resizefs /dev/mapper/oradata_vg-oradata_lv
  Using stripesize of last segment 1.00 MiB                                 
  Size of logical volume oradata_vg/oradata_lv changed from 2.14 TiB (561998 extents) to <3.91 TiB (1023994 extents)
  Logical volume oradata_vg/oradata_lv successfully resized.
meta-data=/dev/mapper/oradata_vg-oradata_lv isize=512    agcount=33, agsize=16382976 blks          
         =                       sectsz=4096  attr=2, projid32bit=1                                   
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1                                 
data     =                       bsize=4096   blocks=524285952, imaxpct=5                          
         =                       sunit=1024   swidth=2048 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1                              
log      =internal log           bsize=4096   blocks=255999, version=2                             
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 524285952 to 575485952
[opc@oracle-19c-fs ~]$

It really has to be the same number of additional PVs, or otherwise you get the following error:

[opc@oracle-19c-fs ~]$ sudo vgdisplay oradata_vg | grep Free               
  Free  PE / Size       255999 / <1000.00 GiB

[opc@oracle-19c-fs ~]$ sudo lvresize --extents +255998 --resizefs /dev/mapper/oradata_vg-oradata_lv
  Using stripesize of last segment 1.00 MiB
  Insufficient suitable allocatable extents for logical volume oradata_lv: 255998 more required

Even though I have been able to add additional space (see above) it doesn’t appear to make a difference in performance:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   06/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.72    0.00    1.19    6.03    0.01   90.05

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.27    0.85      0.01      0.02     0.00     0.48  ...  0.05
dm-0             0.26    0.76      0.01      0.01     0.00     0.00  ...  0.04
dm-1             0.00    0.57      0.00      0.01     0.00     0.00  ...  0.01
sdb              0.11    0.11      0.00      0.00     0.00     0.02  ...  0.02
sdc           1137.78   17.06     15.29      0.29     0.00     0.02  ... 17.33
dm-2             0.11    0.13      0.00      0.00     0.00     0.00  ...  0.02
sdd              0.24    5.62      0.24      0.34     0.00     0.01  ...  1.84
dm-3             0.24    5.63      0.24      0.34     0.00     0.00  ...  1.84
sde           1157.81  417.01     16.34      3.72     0.00     0.04  ... 27.76
dm-4          1136.49   16.67     15.27      0.28     0.00     0.00  ... 17.18
dm-5          1136.97   16.76     15.31      0.29     0.00     0.00  ... 17.19
dm-6          2273.46   33.42     30.58      0.57     0.00     0.00  ... 17.31
sdf              0.00    0.00      0.00      0.00     0.00     0.00  ...  0.00
sdg              0.00    0.00      0.00      0.00     0.00     0.00  ...  0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    3.18   31.19    0.01   64.51

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    2.40      0.00      0.02     0.00     0.40  ...  0.04
dm-0             0.00    2.60      0.00      0.01     0.00     0.00  ...  0.04
dm-1             0.00    0.20      0.00      0.00     0.00     0.00  ...  0.02
sdc           7545.40   32.40     58.83      0.28     0.00     0.00  ... 99.92
sdd              0.00   14.40      0.00      0.06     0.00     0.00  ...  4.16
dm-3             0.00   14.40      0.00      0.06     0.00     0.00  ...  4.16
sde           7519.80   52.60     58.65      0.47     0.00     0.00  ... 99.76
dm-4          7545.20   32.40     58.83      0.28     0.00     0.00  ... 99.90
dm-5          7519.80   52.60     58.65      0.47     0.00     0.00  ... 99.76
dm-6          15065.00   85.00    117.48      0.75     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.62    0.00    3.06   30.02    0.01   65.29

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdb              0.00    0.60      0.00      0.00     0.00     0.00  ...  0.02
sdc           7192.20  124.00     56.07      0.82     0.00     0.00  ... 99.78
dm-2             0.00    0.60      0.00      0.00     0.00     0.00  ...  0.02
sdd              0.00   46.60      0.00      0.17     0.00     0.00  ... 13.50
dm-3             0.00   46.60      0.00      0.17     0.00     0.00  ... 13.46
sde           7184.40   79.60     56.03      0.70     0.00     0.00  ... 99.78
dm-4          7193.60  124.00     56.08      0.82     0.00     0.00  ... 99.78
dm-5          7183.60   79.60     56.03      0.70     0.00     0.00  ... 99.78
dm-6          14377.20  203.60    112.11      1.51     0.00     0.00 ... 100.00

[opc@oracle-19c-fs ~]$ 

As you can see only those disks that were originally part of the volume group are in use. Unlike with Oracle’s Automatic Storage Management there is no automatic rebalancing of data.

Summary

Using LVM RAID 0 is an exciting new feature offering striping in LVM in a different way than previously possible. Compared to the linear allocation model demonstrated in the previous article it allows proper striping across disks in the Logical Volume. It should be noted though that RAID 0 – striping – does not offer any data protection. Failure of a single device in the RAID means all data is lost, immediately. Alternatives need to be in place to ensure your database’s RTO and RPO can be met.

Extending capacity of a LVM RAID 0 VG is possible provided you add the same number of devices (with the same size and performance characteristics) to the VG before executing the lvresize command.

The final article in this series cuts LVM out of the equation and focuses purely on Software RAID 0 and how it can be used in Oracle Linux 8.x and before.

Advertisement