The final part of my “avoiding pitfalls with Linux Logical Volume Manager” (LVM) series considers software RAID on Oracle Linux 8 as the basis for your LVM’s Physical Volume (PV). It’s still the very same VM.Standard.E4.Flex running Oracle 19.12.0 on top of Oracle Linux 8.4 with UEK6 (5.4.17-2102.203.6.el8uek.x86_64) I used for creating the earlier posts.
Previous articles in this series can be found here:
- Deploying I/O intensive workloads in the cloud: don’t fall for the LVM trap
- Deploying I/O intensive workloads in the cloud: LVM RAID
- Deploying I/O intensive workloads in the cloud: mdadm (aka Software) RAID
Storage Configuration
Rather than using LVM-RAID as in the previous article, the plan this time is to create a software RAID (pseudo-device) and use it as a Physical Volume. This is exactly what I have done before I learned about LVM RAID. Strictly speaking, it isn’t necessary to create a Volume Group on top of a RAID device as you can absolutely use such a device on its own. Having said that, growing a RAID 0 device doesn’t seem possible after my limited time studying the documentation. Speaking of which: you can read more about software RAID in Red Hat Linux 8 here.
In this post I’ll demonstrate how you could use a RAID 0 device for striping data across multiple disks. Please don’t implement the steps in this article unless software RAID is an approved solution in your organisation and you are aware of the implications. Kindly note this article does not concern itself with the durability of block devices in the cloud. In the cloud, you have a lot less control over the block devices you get, so make sure you have appropriate protection methods in place to guarantee your databases’ RTO and RPO. RAID 0 offers 0 protection from disk failure (it’s in the name ;), so as soon as you lose a disk from your software RAID, it’s game over.
Creating the RAID Device
The first step is to create the RAID device. For nostalgic reasons I named it /dev/md127
, other sources name their devices /dev/md0
. Not that it matters too much.
[opc@oracle-19c-fs ~]$ sudo mdadm --create /dev/md127 --level=0 \ > --raid-devices=2 /dev/oracleoci/oraclevdc1 /dev/oracleoci/oraclevdd1 mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md127 started. [opc@oracle-19c-fs ~]$
As you can see from the output above mdadm
created the device for me. If you wondered what the funny device names imply, have a look at an earlier post I wrote about device name persistence in OCI.
You can always use mdadm --detail
to get all the interesting details from a RAID device:
[opc@oracle-19c-fs ~]$ sudo mdadm --detail /dev/md127 /dev/md127: Version : 1.2 Creation Time : Fri Aug 6 14:15:12 2021 Raid Level : raid0 Array Size : 524019712 (499.74 GiB 536.60 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Fri Aug 6 14:15:12 2021 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Layout : -unknown- Chunk Size : 512K Consistency Policy : none Name : oracle-19c-fs:127 (local to host oracle-19c-fs) UUID : 30dc8f99... Events : 0 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 [opc@oracle-19c-fs ~]$
This is looking good – both devices are available and no errors have occurred.
Creating oradata_vg
With the future PV available it’s time to create the Volume Group and the Logical Volumes (LV) for the database and Fast Recovery Area. I’m listing the steps here for later reference, although they are the same as in part 1 of this article.
[opc@oracle-19c-fs ~]$ # [opc@oracle-19c-fs ~]$ # step 1) create the PV [opc@oracle-19c-fs ~]$ sudo pvcreate /dev/md127 Physical volume "/dev/md127" successfully created. [opc@oracle-19c-fs ~]$ # [opc@oracle-19c-fs ~]$ # step 2) create the VG [opc@oracle-19c-fs ~]$ sudo vgcreate oradata_vg /dev/md127 Volume group "oradata_vg" successfully created [opc@oracle-19c-fs ~]$ # [opc@oracle-19c-fs ~]$ # step 3) create the first LV [opc@oracle-19c-fs ~]$ sudo lvcreate --extents 80%FREE --name oradata_lv oradata_vg Logical Volume "oradata_lv" created [opc@oracle-19c-fs ~]$ # [opc@oracle-19c-fs ~]$ # step 4) create the second LV [opc@oracle-19c-fs ~]$ sudo lvcreate --extents 100%FREE --name orareco_lv oradata_vg Logical volume "orareco_lv" created.
The end result are 2 LVs in oradata_vg
:
[opc@oracle-19c-fs ~]$ sudo lvs oradata_vg LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert oradata_lv oradata_vg -wi-a----- 399.79g orareco_lv oradata_vg -wi-a----- <99.95g
That’s it! The LVs require file systems before they can be mounted (not shown here).
Trying it out
After the final touches have been applied I restored the database and started the familiar Swingbench workload to see which disks are in use. Right before I did that I ensured I’m not multiplexing control files/online redo logs in the FRA for test purposes only. NOT multiplexing control files/online redo log members is probably a Bad Idea for serious Oracle deployments but ok for this scenario.
I am expecting to see both block devices making up /dev/md127
used. And sure enough, they are:
[opc@oracle-19c-fs ~]$ iostat -xmz 5 3 Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs) 13/08/21 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.23 0.01 0.35 0.57 0.01 98.83 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 2.99 0.96 0.08 0.04 0.03 0.26 ... 0.21 dm-0 2.78 0.62 0.07 0.03 0.00 0.00 ... 0.20 dm-1 0.06 0.58 0.00 0.01 0.00 0.00 ... 0.02 sdb 1.28 0.22 0.06 0.00 0.00 0.02 ... 0.13 dm-2 1.26 0.24 0.06 0.00 0.00 0.00 ... 0.13 sdc 753.52 26.38 8.37 5.64 30.91 0.29 ... 7.36 md127 1573.79 53.30 17.44 12.01 0.00 0.00 ... 0.00 sdd 758.09 26.57 8.42 5.64 31.29 0.05 ... 9.34 sde 20.53 0.00 5.11 0.00 0.00 0.00 ... 1.79 dm-3 20.51 0.00 5.11 0.00 0.00 0.00 ... 1.79 dm-4 1558.54 28.25 12.20 5.97 0.00 0.00 ... 6.56 dm-5 4.69 2.61 4.58 5.26 0.00 0.00 ... 4.15 avg-cpu: %user %nice %system %iowait %steal %idle 4.08 0.00 5.32 9.48 0.13 80.99 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 0.00 3.40 0.00 0.03 0.00 0.60 ... 0.08 dm-0 0.00 2.60 0.00 0.02 0.00 0.00 ... 0.08 dm-1 0.00 1.40 0.00 0.01 0.00 0.00 ... 0.04 sdc 16865.80 284.60 140.04 2.39 1059.60 0.20 ... 92.60 md127 36008.00 564.20 281.33 4.76 0.00 0.00 ... 0.00 sdd 16978.80 279.40 141.11 2.34 1081.40 0.00 ... 99.96 dm-4 36007.80 563.00 281.33 4.73 0.00 0.00 ... 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 4.07 0.00 5.51 10.52 0.16 79.74 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sdb 0.00 0.80 0.00 0.01 0.00 0.20 ... 0.04 dm-2 0.00 1.00 0.00 0.01 0.00 0.00 ... 0.04 sdc 17709.80 317.80 142.87 2.51 577.40 0.00 ... 93.90 md127 36657.80 661.60 286.41 5.31 0.00 0.00 ... 0.00 sdd 17790.00 343.40 143.69 2.77 599.00 0.00 ... 99.94 dm-4 36657.80 660.20 286.41 5.28 0.00 0.00 ... 100.00 [opc@oracle-19c-fs ~]$
No surprises here! Except maybe that /dev/md127
was somewhat underutilised ;) I guess that’s an instrumentation bug/feature. /dev/dm-4
– showing 100% utilisation – belongs to oradata_lv
:
[opc@oracle-19c-fs ~]$ ls -l /dev/mapper | egrep dm-4 lrwxrwxrwx. 1 root root 7 Aug 13 09:37 oradata_vg-oradata_lv -> ../dm-4
Extending oradata_vg
Just as with each previous example I’d like to see what happens when I run out of space and have to extend oradata_vg
. For this to happen I need a couple more block devices. These have to match the existing ones in size and performance characteristics for the best result. No difference to LVM-RAID I covered in the earlier article.
I created /dev/md128
in the same way as I did for the original RAID device and created a Physical Volume from it. oradata_vg looked like this prior to its extension:
[opc@oracle-19c-fs ~]$ sudo vgs oradata_vg VG #PV #LV #SN Attr VSize VFree oradata_vg 1 2 0 wz--n- 499.74g 0
In the next step I extended the Volume Group but only after I ensured I have a proven, working backup of everything. Don’t ever make changes to the storage layer without a backup and a known, tested, proven way to recover from unforeseen issues!
[opc@oracle-19c-fs ~]$ sudo vgextend oradata_vg /dev/md128 Volume group "oradata_vg" successfully extended [opc@oracle-19c-fs ~]$ sudo vgs oradata_vg VG #PV #LV #SN Attr VSize VFree oradata_vg 2 2 0 wz--n- 999.48g 499.74g
The VG now shows 2 PVs and plenty of free space. So let’s add 80% of the free space to oradata_lv
.
[opc@oracle-19c-fs ~]$ sudo lvresize --extents +80%FREE --resizefs /dev/mapper/oradata_vg-oradata_lv Size of logical volume oradata_vg/oradata_lv changed from 399.79 GiB (102347 extents) to <799.59 GiB (204695 extents). Logical volume oradata_vg/oradata_lv successfully resized. meta-data=/dev/mapper/oradata_vg-oradata_lv isize=512 agcount=16, agsize=6550144 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=104802304, imaxpct=25 = sunit=128 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=51173, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 104802304 to 209607680
The LV changes from its original size …
[opc@oracle-19c-fs ~]$ sudo lvs /dev/mapper/oradata_vg-oradata_lv LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert oradata_lv oradata_vg -wi-ao---- 399.79g
to its new size:
[opc@oracle-19c-fs ~]$ sudo lvs /dev/mapper/oradata_vg-oradata_lv LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert oradata_lv oradata_vg -wi-ao---- <799.59g
The same applies to the file system as well:
[opc@oracle-19c-fs ~]$ df -h /u01/oradata Filesystem Size Used Avail Use% Mounted on /dev/mapper/oradata_vg-oradata_lv 800G 38G 762G 5% /u01/oradata
Does that change performance?
Based on my experience with LVM-RAID I did not expect a change in performance as my database wasn’t yet at a stage where it required the extra space yet. My assumption was confirmed by iostat
:
[opc@oracle-19c-fs ~]$ iostat -xmz 5 3 Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs) 13/08/21 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.98 0.01 1.44 2.35 0.03 95.18 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 2.32 0.99 0.06 0.03 0.02 0.27 ... 0.17 dm-0 2.16 0.61 0.06 0.03 0.00 0.00 ... 0.16 dm-1 0.05 0.62 0.00 0.01 0.00 0.00 ... 0.02 sdb 0.99 0.20 0.05 0.00 0.00 0.02 ... 0.11 dm-2 0.98 0.22 0.04 0.00 0.00 0.00 ... 0.11 sdc 4538.44 73.12 38.69 4.78 190.85 0.23 ... 26.27 md127 9485.50 147.14 78.09 10.13 0.00 0.00 ... 0.00 sdd 4562.89 73.73 38.90 4.79 193.25 0.04 ... 29.88 sde 15.87 0.00 3.95 0.00 0.00 0.00 ... 1.39 dm-3 15.86 0.00 3.95 0.00 0.00 0.00 ... 1.39 dm-4 9473.71 127.63 74.04 5.46 0.00 0.00 ... 27.74 dm-5 3.63 2.02 3.54 4.07 0.00 0.00 ... 3.21 sdf 0.07 0.00 0.00 0.00 0.00 0.01 ... 0.01 sdg 0.08 0.00 0.00 0.00 0.00 0.01 ... 0.00 md128 0.06 0.02 0.00 0.00 0.00 0.00 ... 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 3.96 0.00 5.44 8.52 0.08 82.00 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sdc 17652.60 306.80 141.15 2.52 414.40 0.00 ... 88.78 md127 36265.40 608.00 283.35 5.01 0.00 0.00 ... 0.00 sdd 17783.60 301.20 142.17 2.43 411.60 0.00 ... 100.00 dm-4 36267.40 607.00 283.37 4.95 0.00 0.00 ... 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 4.20 0.00 5.45 8.82 0.14 81.38 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 0.00 1.20 0.00 0.01 0.00 0.00 ... 0.04 dm-0 0.00 1.00 0.00 0.01 0.00 0.00 ... 0.04 dm-1 0.00 0.20 0.00 0.00 0.00 0.00 ... 0.02 sdc 18145.40 332.20 143.99 2.55 284.40 0.00 ... 92.22 md127 36865.20 650.20 288.04 5.00 0.00 0.00 ... 0.00 sdd 18161.20 318.00 144.14 2.45 285.20 0.00 ... 99.98 dm-4 36863.20 649.00 288.02 4.99 0.00 0.00 ... 99.98 [opc@oracle-19c-fs ~]$
As long as there aren’t any database files in the “extended” part of the LV, there won’t be a change in performance. As soon as your database spills over to the “new” disks, you should see a benefit from the newly added /dev/dm128
.
Summary
Just as LVM-RAID does, using software RAID allows you to benefit from striping data across multiple devices. The iostat
output is quite clear about the benefit, just look at the figures for /dev/sdc
, /dev/sdd
and how they accumulate in /dev/md127
.
Using software RAID doesn’t come without a risk, it’s entirely possible to lose a block device and thus the RAID device. It’s imperative you protect against this scenario in a way that matches your database’s RTO and RPO.
My main problem with the solution as detailed in this post is the lack of a re-balance feature you get with Oracle’s Automatic Storage Management (ASM). It’s still possible to have I/O hotspots after a storage space expansion.