My lab server has 2 SSDs, one is connected using SATA 2 and another is connected using SATA 3. I’d expect the SATA 3 connected device to be equally well equipped or even better to do work than the “old” interface. I ran SLOB on these devices to find out if there was a difference. To my great surprise the SATA2 – connected SSD performed a lot better than the SATA 3 device, as shown in the AWR report! Initially I was not entirely sure why, since the FIO results on both devices are roughly equal. You will see why though when reading this post. In summary: use XFS for any concurrent writes. Or maybe ASM.
Let’s investigate
Let’s do a little I/O investigation because a) it’s cool and b) you can.
This is Oracle Linux 6.5 with UEK 3. A word of warning though-ext4 has been, let’s say, a little flaky. I first found out about that in 2011 (https://martincarstenbach.wordpress.com/2011/11/04/an-interesting-problem-with-ext4-on-oracle-linux-5-5/) and Oracle now published a MOS note: ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1). I am on a reasonably recent version so the observations in this post are probably not related to the above notes. The device in /dev/sda uses SATA2 and the file system /u01/app/oracle/oradata is XFS formatted. The other SSD, /dev/sde uses EXT4 but is connected via SATA3 provided by a PCIe v2 x4 card. I wish I had PCIe v3 but my Opteron 6200 system doesn’t support native SATA6G.
SLOB
I created a 12c non-CDB database for SLOB with the IOPS tablespace on my SATA 2 SSD that’s mounted on XFS. Every test has been repeated four times. Should you want to follow along, here is my slob.conf:
[/m/kvm/oracle]$ cat ~/SLOB2/SLOB/slob.conf UPDATE_PCT=20 RUN_TIME=120 WORK_LOOP=0 SCALE=10000 WORK_UNIT=256 REDO_STRESS=HEAVY LOAD_PARALLEL_DEGREE=8 SHARED_DATA_MODULUS=0 # Settings for SQL*Net connectivity: #ADMIN_SQLNET_SERVICE=slob #SQLNET_SERVICE_BASE=slob #SQLNET_SERVICE_MAX=2 #SYSDBA_PASSWD="change_on_install" export UPDATE_PCT RUN_TIME WORK_LOOP SCALE WORK_UNIT LOAD_PARALLEL_DEGREE REDO_STRESS SHARED_DATA_MODULUS [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
Notice the UPDATE_PCT-that will become important later. The SLOB configuration has been created using a call to ./setup.sh with IOPS as the target tablespace and 128 users which is the default. My database uses this initialisation file shown below, which is more or lesse the same as the default (and my buffer cache is indeed 16M)
[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12] [/m/kvm/oracle]$ cat /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/initSLOB12.ora #set the following to paths that make sense on your system: db_create_file_dest = '/u01/app/oracle/oradata' control_files='/u01/app/oracle/oradata/SLOB12/controlfile/o1_mf_b52b4lvk_.ctl' db_name = SLOB12 compatible = 12.1.0.2.0 undo_management = auto db_block_size = 8192 db_files = 2000 processes = 500 shared_pool_size = 1G db_cache_size = 10M # Will force maximum physical I/O #db_cache_size = 15G # Sufficient for cached runs with reference scale (10,000 rows) and 128 sessions filesystemio_options = setall parallel_max_servers = 0 log_buffer = 134217728 cpu_count = 2 pga_aggregate_target = 1G remote_login_passwordfile = exclusive workarea_size_policy = auto recyclebin = off # The following are needed to force db file sequential read. Omit for direct path short scans. # If you would like to investigate a blend of short scans mixed with single-block random I/O # then comment out these three parameters : _db_block_prefetch_limit = 0 _db_block_prefetch_quota = 0 _db_file_noncontig_mblock_read_count = 0 # Uncomment the following if suffering library cache contention (at larger numbers of sessions) or high log file sync waits # at CPU-saturated levels. #_cursor_obsolete_threshold=130 #_high_priority_processes=LGWR [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
After the fourth execution of ./runit.sh 16 I moved the IOPS data file from /dev/sda to /dev/sde and ran another 4 tests.
Observations
I ran an AWR difference report to compare the performance of the XFS and EXT4 SSDs. The most interesting bit is right on the top. Apologies for the very wide output-that’s the way the text report lays it out.
WORKLOAD REPOSITORY COMPARE PERIOD REPORT Snapshot Set DB Id Instance Inst Num Release Cluster Host Std Block Size ------------ ----------- ------------ -------- ----------- ------- ------------ --------------- First (1st) 1269760792 SLOB12 1 12.1.0.2.0 NO ol62.localdo 8192 Second (2nd) 1269760792 SLOB12 1 12.1.0.2.0 NO ol62.localdo 8192 Snapshot Set Begin Snap Id Begin Snap Time End Snap Id End Snap Time Avg Active Users Elapsed Time (min) DB time (min) ------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- -------------------------- 1st 41 31-Oct-14 12:18:14 (Fri) 42 31-Oct-14 12:20:17 (Fri) 15.8 2.0 32.1 2nd 49 31-Oct-14 12:38:18 (Fri) 50 31-Oct-14 12:40:20 (Fri) 15.8 2.0 32.3 ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ %Diff: 0.2% 0.0% 0.5% Host Configuration Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1st 2nd Diff %Diff ----------------------------------- -------------------- -------------------- -------------------- --------- Number of CPUs: 24 24 0 0.0 Number of CPU Cores: 24 24 0 0.0 Number of CPU Sockets: 2 2 0 0.0 Physical Memory: 64413.7M 64413.7M 0M 0.0 Load at Start Snapshot: 9.13 9.57 .44 4.8 Load at End Snapshot: 15.38 15.6 .22 1.4 %User Time: 1.24 .64 -.6 -48.4 %System Time: .75 .74 -.02 -1.3 %Idle Time: 97.89 98.6 .71 0.7 %IO Wait Time: 35.21 9.06 -26.15 -74.3 Cache Sizes ~~~~~~~~~~~ 1st (M) 2nd (M) Diff (M) %Diff ---------------------- ---------- ---------- ---------- -------- Memory Target .....SGA Target ..........Buffer Cache 16.0 16.0 0.0 0.0 ..........Shared Pool 1,024.0 1,024.0 0.0 0.0 ..........Large Pool ..........Java Pool 16.0 16.0 0.0 0.0 ..........Streams Pool .....PGA Target 1,024.0 1,024.0 0.0 0.0 Log Buffer 128.0 128.0 0.0 0.0 In-Memory Area Workload Comparison ~~~~~~~~~~~~~~~~~~~ 1st Per Sec 2nd Per Sec %Diff 1st Per Txn 2nd Per Txn %Diff --------------- --------------- ------ --------------- --------------- ------ DB time: 15.8 15.8 0.2 2.5 4.4 78.7 CPU time: 0.5 0.3 -29.8 0.1 0.1 28.6 Background CPU time: 0.1 0.1 -40.0 0.0 0.0 0.0 Redo size (bytes): 8,721,293.5 4,652,147.0 -46.7 1,379,498.8 1,311,306.6 -4.9 Logical read (blocks): 10,403.9 5,231.1 -49.7 1,645.6 1,474.5 -10.4 Block changes: 5,216.5 2,640.2 -49.4 825.1 744.2 -9.8 Physical read (blocks): 7,412.5 3,750.0 -49.4 1,172.5 1,057.0 -9.8 Physical write (blocks): 3,369.0 1,729.0 -48.7 532.9 487.4 -8.5 Read IO requests: 7,411.8 3,749.3 -49.4 1,172.4 1,056.8 -9.9 Write IO requests: 3,018.0 1,538.2 -49.0 477.4 433.6 -9.2 Read IO (MB): 57.9 29.3 -49.4 9.2 8.3 -9.8 Write IO (MB): 26.3 13.5 -48.7 4.2 3.8 -8.4 IM scan rows: 0.0 0.0 0.0 0.0 0.0 0.0 Session Logical Read IM: User calls: 1.4 1.4 -1.4 0.2 0.4 77.3 Parses (SQL): 5.9 6.3 7.5 0.9 1.8 91.4 Hard parses (SQL): 0.1 0.3 107.1 0.0 0.1 300.0 SQL Work Area (MB): 0.6 0.4 -25.9 0.1 0.1 -25.9 Logons: 0.1 0.1 0.0 0.0 0.0 100.0 Executes (SQL): 38.1 23.7 -37.6 6.0 6.7 11.1 Transactions: 6.3 3.5 -43.8 First Second Diff --------------- --------------- ------ % Blocks changed per Read: 50.1 50.5 0.3 Recursive Call %: 98.1 97.6 -0.6 Rollback per transaction %: 0.0 0.0 0.0 Rows per Sort: 68.3 35.4 -32.9 Avg DB time per Call (sec): 11.3 11.5 0.2</pre> <pre>...
I copied the header from the report to show you that nothing really changed between the test executions, and that I’m going to do a lot of I/O since the buffer cache is only 16 MB.
What is striking is that the physical reads and writes are roughly half of what I saw during the first batch of executions. The culprit is found quickly. In this scenario I chose SLOB is primarily performing sequential reads:
Top Timed Events First DB/Inst: SLOB12/SLOB12 Snaps: 41-42 (Elapsed time: 122.27 sec DB time: 1925.83 sec), Second DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec DB time: 1935.33 sec) -> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes 1st 2nd ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Event Wait Class Waits Time(s) Avg Time(ms) %DB time Event Wait Class Waits Time(s) Avg Time(ms) %DB time ------------------------------ ------------- ------------ ------------ ------------- ----------- ------------------------------ ------------- ------------ ------------ ------------- ----------- db file sequential read User I/O 906,150 1,899.1 2.1 98.6 db file sequential read User I/O 459,637 1,920.9 4.2 99.3 db file parallel write System I/O 116,318 60.0 0.5 3.1 db file async I/O submit System I/O 2,975 97.2 32.7 5.0 CPU time N/A 57.7 N/A 3.0 CPU time N/A 40.2 N/A 2.1 log file parallel write System I/O 3,485 47.4 13.6 2.5 log file parallel write System I/O 1,881 3.2 1.7 0.2 db file async I/O submit System I/O 7,425 3.7 0.5 0.2 db file parallel write System I/O 4,670 0.9 0.2 0.0 ...
The IO times for the second interval (the EXT4 case) are twice as what they are on the first. But why? I thought at first that the PCIe card adds the extra latency. And I came up with a few other ideas that all remained that: ideas. Ideas are nice, but proof is better. Maybe there is indeed something odd? Or maybe there is something in the way that Oracle uses the device? To find out I used FIO. FIO is my favourite I/O benchmark when it comes to measuring raw performance. If the FIO benchmark results match the Oracle results then indeed there must be something wrong with the SSD or the way it is addressed. I spent a bit of time to compile FIO and ran a set of tests.
FIO
I was really quite curious to see what was happening during the execution of FIO-was the SATA3 SSD really slower? So I started FIO tests, using FIO 2.1.3. Using the same test harness, just the destination is different:
[~/fio-2.1.13]$ cat rand-read-sd* [rand-read] numjobs=4 rw=randread directory=/u01/fio/data size=5G bs=8k ioengine=libaio buffered=0 iodepth=8 [rand-read] numjobs=4 rw=randread directory=/m/kvm/oracle size=5G bs=8k ioengine=libaio buffered=0 iodepth=8 [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12] [~/fio-2.1.13]$
The /u01 file system sits on /dev/sda, and uses XFS. /m/kvm/oracle is on /dev/sde, using EXT4. Here is the output from the tests:
[~/fio-2.1.13]$ ./fio rand-read-sde.fio rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8 ... fio-2.1.13 Starting 4 processes rand-read: Laying out IO file(s) (1 file(s) / 5120MB) rand-read: Laying out IO file(s) (1 file(s) / 5120MB) rand-read: Laying out IO file(s) (1 file(s) / 5120MB) Jobs: 3 (f=3): [r(3),_(1)] [99.2% done] [192.8MB/0KB/0KB /s] [24.7K/0/0 iops] [eta 00m:01s] rand-read: (groupid=0, jobs=1): err= 0: pid=6124: Thu Oct 30 15:41:15 2014 read : io=5120.0MB, bw=44174KB/s, iops=5521, runt=118686msec slat (usec): min=4, max=109, avg=13.06, stdev= 6.14 clat (usec): min=128, max=43082, avg=1434.23, stdev=586.65 lat (usec): min=140, max=43094, avg=1447.47, stdev=586.41 clat percentiles (usec): | 1.00th=[ 516], 5.00th=[ 708], 10.00th=[ 820], 20.00th=[ 988], | 30.00th=[ 1112], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480], | 70.00th=[ 1608], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480], | 99.00th=[ 3184], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4576], | 99.99th=[ 6752] bw (KB /s): min=27088, max=70928, per=24.99%, avg=44119.50, stdev=6314.88 lat (usec) : 250=0.01%, 500=0.78%, 750=6.02%, 1000=14.12% lat (msec) : 2=65.54%, 4=13.37%, 10=0.15%, 20=0.01%, 50=0.01% cpu : usr=1.89%, sys=11.05%, ctx=421189, majf=0, minf=44 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 rand-read: (groupid=0, jobs=1): err= 0: pid=6125: Thu Oct 30 15:41:15 2014 read : io=5120.0MB, bw=44138KB/s, iops=5517, runt=118784msec slat (usec): min=4, max=108, avg=13.14, stdev= 6.14 clat (usec): min=117, max=43252, avg=1435.31, stdev=587.66 lat (usec): min=124, max=43264, avg=1448.64, stdev=587.40 clat percentiles (usec): | 1.00th=[ 502], 5.00th=[ 700], 10.00th=[ 820], 20.00th=[ 988], | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480], | 70.00th=[ 1624], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480], | 99.00th=[ 3152], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4640], | 99.99th=[ 6688] bw (KB /s): min=27504, max=69760, per=24.91%, avg=43972.22, stdev=6169.69 lat (usec) : 250=0.07%, 500=0.92%, 750=5.84%, 1000=13.94% lat (msec) : 2=65.68%, 4=13.40%, 10=0.16%, 20=0.01%, 50=0.01% cpu : usr=1.89%, sys=11.10%, ctx=421379, majf=0, minf=46 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 rand-read: (groupid=0, jobs=1): err= 0: pid=6126: Thu Oct 30 15:41:15 2014 read : io=5120.0MB, bw=44230KB/s, iops=5528, runt=118538msec slat (usec): min=4, max=120, avg=12.59, stdev= 5.61 clat (usec): min=185, max=42871, avg=1433.09, stdev=581.75 lat (usec): min=196, max=42881, avg=1445.85, stdev=581.45 clat percentiles (usec): | 1.00th=[ 532], 5.00th=[ 716], 10.00th=[ 828], 20.00th=[ 988], | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1464], | 70.00th=[ 1608], 80.00th=[ 1800], 90.00th=[ 2128], 95.00th=[ 2480], | 99.00th=[ 3152], 99.50th=[ 3440], 99.90th=[ 4192], 99.95th=[ 4576], | 99.99th=[ 7008] bw (KB /s): min=27600, max=65952, per=25.01%, avg=44149.65, stdev=5945.98 lat (usec) : 250=0.01%, 500=0.63%, 750=5.90%, 1000=14.17% lat (msec) : 2=65.89%, 4=13.25%, 10=0.15%, 20=0.01%, 50=0.01% cpu : usr=1.74%, sys=11.20%, ctx=426219, majf=0, minf=44 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 rand-read: (groupid=0, jobs=1): err= 0: pid=6127: Thu Oct 30 15:41:15 2014 read : io=5120.0MB, bw=47359KB/s, iops=5919, runt=110704msec slat (usec): min=4, max=114, avg=12.83, stdev= 5.90 clat (usec): min=188, max=42988, avg=1337.06, stdev=571.06 lat (usec): min=198, max=43000, avg=1350.07, stdev=570.63 clat percentiles (usec): | 1.00th=[ 446], 5.00th=[ 620], 10.00th=[ 748], 20.00th=[ 900], | 30.00th=[ 1020], 40.00th=[ 1144], 50.00th=[ 1256], 60.00th=[ 1368], | 70.00th=[ 1512], 80.00th=[ 1704], 90.00th=[ 2040], 95.00th=[ 2352], | 99.00th=[ 3024], 99.50th=[ 3280], 99.90th=[ 3984], 99.95th=[ 4384], | 99.99th=[ 6624] bw (KB /s): min=33664, max=51008, per=26.83%, avg=47369.34, stdev=4051.34 lat (usec) : 250=0.01%, 500=1.45%, 750=8.72%, 1000=17.95% lat (msec) : 2=61.12%, 4=10.65%, 10=0.09%, 20=0.01%, 50=0.01% cpu : usr=1.98%, sys=11.60%, ctx=410433, majf=0, minf=44 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=20480MB, aggrb=176551KB/s, minb=44137KB/s, maxb=47359KB/s, mint=110704msec, maxt=118784msec Disk stats (read/write): sde: ios=2619049/4, merge=0/1, ticks=3663514/6, in_queue=3662827, util=100.00%
The best tool to look at IO without tracing block devices is collectl (in my opinion). I ran the extended disk stats in parallel:
[root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sde waiting for 5 second sample... # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util sde 0 0 0 0 125340 129 123 1021 1020 219 1749 8 99 sde 0 0 0 0 119608 58 118 1017 1017 156 1666 7 88 sde 146968 0 18K 8 0 0 0 0 8 31 1 0 89 sde 165435 0 20K 8 2 0 0 6 7 31 1 0 99 sde 187097 0 23K 8 0 0 0 0 8 31 1 0 100 sde 187113 0 23K 8 0 0 0 0 8 31 1 0 99 sde 157684 0 19K 8 0 0 0 0 8 31 1 0 99 sde 186542 0 23K 8 1 0 0 4 7 31 1 0 100 sde 186721 0 23K 8 0 0 0 0 8 31 1 0 99 sde 158416 0 19K 8 0 0 0 0 8 31 1 0 100 sde 187168 0 23K 8 0 0 0 0 8 31 1 0 99 sde 186027 0 23K 8 0 0 0 0 8 31 1 0 100 sde 158366 0 19K 8 0 0 0 0 8 31 1 0 99 sde 186028 0 23K 8 0 0 0 0 8 31 1 0 99 sde 186416 0 23K 8 0 0 0 0 8 31 1 0 100 sde 159102 0 19K 8 0 0 0 0 8 31 1 0 100 sde 186348 0 23K 8 0 0 0 0 8 31 1 0 100 sde 186859 0 23K 8 0 0 0 0 8 31 1 0 99 sde 153996 0 19K 8 0 0 0 0 8 31 1 0 99 sde 186043 0 23K 8 0 0 0 0 8 31 1 0 100 sde 185348 0 23K 8 0 0 0 0 8 31 1 0 99 sde 158150 0 19K 8 0 0 0 0 8 31 1 0 99 sde 186148 0 23K 8 0 0 0 0 8 31 1 0 99 sde 185900 0 23K 8 0 0 0 0 8 31 1 0 99 sde 159353 0 19K 8 0 0 0 0 8 25 1 0 99 sde 167065 0 20K 8 0 0 0 0 8 22 0 0 86 sde 0 0 0 0 0 0 0 0 0 0 0 0 0
This matches the FIO output-and indeed, the device can do around 23k IOPS with an 8k IO size. What about the XFS mount?
[~/fio-2.1.13]$ ./fio rand-read-sda.fio rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8 ... fio-2.1.13 Starting 4 processes rand-read: Laying out IO file(s) (1 file(s) / 5120MB) rand-read: Laying out IO file(s) (1 file(s) / 5120MB) rand-read: Laying out IO file(s) (1 file(s) / 5120MB) Jobs: 4 (f=4): [r(4)] [100.0% done] [191.3MB/0KB/0KB /s] [24.5K/0/0 iops] [eta 00m:00s] rand-read: (groupid=0, jobs=1): err= 0: pid=6329: Thu Oct 30 15:52:03 2014 read : io=5120.0MB, bw=49087KB/s, iops=6135, runt=106809msec slat (usec): min=3, max=93, avg= 7.90, stdev= 6.40 clat (usec): min=280, max=35113, avg=1294.26, stdev=152.91 lat (usec): min=285, max=35130, avg=1302.35, stdev=152.60 clat percentiles (usec): | 1.00th=[ 1112], 5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1240], | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304], | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416], | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224], | 99.99th=[ 3024] bw (KB /s): min=46000, max=49536, per=25.02%, avg=49124.10, stdev=291.98 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.03% lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=2.39%, sys=8.05%, ctx=505195, majf=0, minf=43 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 rand-read: (groupid=0, jobs=1): err= 0: pid=6330: Thu Oct 30 15:52:03 2014 read : io=5120.0MB, bw=49093KB/s, iops=6136, runt=106795msec slat (usec): min=3, max=114, avg= 7.55, stdev= 5.74 clat (usec): min=263, max=35353, avg=1294.46, stdev=155.68 lat (usec): min=271, max=35369, avg=1302.21, stdev=155.46 clat percentiles (usec): | 1.00th=[ 1112], 5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1256], | 30.00th=[ 1272], 40.00th=[ 1288], 50.00th=[ 1288], 60.00th=[ 1304], | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416], | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224], | 99.99th=[ 3184] bw (KB /s): min=45728, max=49472, per=25.02%, avg=49123.20, stdev=300.30 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02% lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=2.42%, sys=7.88%, ctx=549312, majf=0, minf=46 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 rand-read: (groupid=0, jobs=1): err= 0: pid=6331: Thu Oct 30 15:52:03 2014 read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec slat (usec): min=3, max=137, avg= 7.70, stdev= 5.87 clat (usec): min=266, max=35102, avg=1294.64, stdev=147.64 lat (usec): min=274, max=35121, avg=1302.52, stdev=147.39 clat percentiles (usec): | 1.00th=[ 1112], 5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256], | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304], | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416], | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 2064], 99.95th=[ 2256], | 99.99th=[ 3600] bw (KB /s): min=45888, max=49536, per=25.02%, avg=49110.62, stdev=295.81 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.04% lat (msec) : 2=99.84%, 4=0.10%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=2.32%, sys=8.07%, ctx=536841, majf=0, minf=44 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 rand-read: (groupid=0, jobs=1): err= 0: pid=6332: Thu Oct 30 15:52:03 2014 read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec slat (usec): min=3, max=106, avg= 7.66, stdev= 5.93 clat (usec): min=274, max=35268, avg=1294.67, stdev=157.16 lat (usec): min=281, max=35286, avg=1302.53, stdev=156.91 clat percentiles (usec): | 1.00th=[ 1112], 5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256], | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304], | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416], | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1944], 99.95th=[ 2224], | 99.99th=[ 3536] bw (KB /s): min=45840, max=49536, per=25.02%, avg=49109.56, stdev=303.79 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.05% lat (msec) : 2=99.84%, 4=0.09%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=2.50%, sys=7.83%, ctx=533469, majf=0, minf=44 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=655360/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=20480MB, aggrb=196318KB/s, minb=49079KB/s, maxb=49092KB/s, mint=106795msec, maxt=106824msec Disk stats (read/write): sda: ios=2620077/60, merge=0/2, ticks=3348979/94, in_queue=3348631, util=100.00% [oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12] [~/fio-2.1.13]$
Here is some of the corresponding collectl output:
[root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sda waiting for 5 second sample... # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util sda 196313 0 24K 8 5 0 1 9 8 31 1 0 99 sda 196681 0 24K 8 6 0 0 16 8 31 1 0 100 sda 196369 0 24K 8 7 0 1 12 8 31 1 0 99 sda 196217 0 24K 8 4 0 0 10 8 31 1 0 100 sda 196300 0 24K 8 14 0 2 6 8 31 1 0 100 sda 196699 0 24K 8 6 0 0 16 8 31 1 0 99 sda 196340 0 24K 8 3 0 0 16 8 31 1 0 99 sda 196350 0 24K 8 6 0 0 16 8 31 1 0 100 sda 196379 0 24K 8 6 0 0 16 8 31 1 0 99 sda 196560 0 24K 8 3 0 0 16 8 31 1 0 99 sda 196259 0 24K 8 8 0 1 8 8 31 1 0 99
Summary Number 1
So it would appear that the SSDs are more or less equal from FIO’s point of view. If you followed closely (hurray if you did-I appreciate this is a long post!) you will have noticed the following:
- The SLOB test used updates (20%). There is furthermore just one data file for the IOPS tablespace (-> there is concurrency!)
- The FIO test was based on random reads, no writes. It’s 4 jobs also worked on 4 distinct files.
I have tried to avoid the effect of file system buffering by setting buffer to 0 and I’m also using asynchronous IO as provided by libaio (just as Oracle does).
Theory Number 2
It would appear that XFS is better suited for concurrent writes. Let’s put that theory into action. I reformatted the SATA3 SSD with XFS and ran another test. Feast your eyes-comparing the AWR reports of the EXT4 test execution to XFS on /dev/sde:
WORKLOAD REPOSITORY COMPARE PERIOD REPORT Snapshot Set DB Id Instance Inst Num Release Cluster Host Std Block Size ------------ ----------- ------------ -------- ----------- ------- ------------ --------------- First (1st) 1269760792 SLOB12 1 12.1.0.2.0 NO ol62.localdo 8192 Second (2nd) 1269760792 SLOB12 1 12.1.0.2.0 NO ol62.localdo 8192 Snapshot Set Begin Snap Id Begin Snap Time End Snap Id End Snap Time Avg Active Users Elapsed Time (min) DB time (min) ------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- -------------------------- 1st 49 31-Oct-14 12:38:18 (Fri) 50 31-Oct-14 12:40:20 (Fri) 15.8 2.0 32.3 2nd 55 31-Oct-14 15:35:23 (Fri) 56 31-Oct-14 15:37:25 (Fri) 15.7 2.0 32.1 ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ %Diff: 0.2% -0.5% -0.6% Host Configuration Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1st 2nd Diff %Diff ----------------------------------- -------------------- -------------------- -------------------- --------- Number of CPUs: 24 24 0 0.0 Number of CPU Cores: 24 24 0 0.0 Number of CPU Sockets: 2 2 0 0.0 Physical Memory: 64413.7M 64413.7M 0M 0.0 Load at Start Snapshot: 9.57 1.77 -7.8 -81.5 Load at End Snapshot: 15.6 14.15 -1.45 -9.3 %User Time: .64 2.07 1.43 223.4 %System Time: .74 1.48 .75 100.0 %Idle Time: 98.6 96.4 -2.2 -2.2 %IO Wait Time: 9.06 53.29 44.23 488.2 Cache Sizes ~~~~~~~~~~~ 1st (M) 2nd (M) Diff (M) %Diff ---------------------- ---------- ---------- ---------- -------- Memory Target .....SGA Target ..........Buffer Cache 16.0 16.0 0.0 0.0 ..........Shared Pool 1,024.0 1,024.0 0.0 0.0 ..........Large Pool ..........Java Pool 16.0 16.0 0.0 0.0 ..........Streams Pool .....PGA Target 1,024.0 1,024.0 0.0 0.0 Log Buffer 128.0 128.0 0.0 0.0 In-Memory Area Workload Comparison ~~~~~~~~~~~~~~~~~~~ 1st Per Sec 2nd Per Sec %Diff 1st Per Txn 2nd Per Txn %Diff --------------- --------------- ------ --------------- --------------- ------ DB time: 15.8 15.8 0.2 4.4 1.5 -66.7 CPU time: 0.3 0.9 175.8 0.1 0.1 0.0 Background CPU time: 0.1 0.1 133.3 0.0 0.0 -50.0 Redo size (bytes): 4,652,147.0 15,186,494.1 226.4 1,311,306.6 1,426,136.0 8.8 Logical read (blocks): 5,231.1 16,957.8 224.2 1,474.5 1,592.5 8.0 Block changes: 2,640.2 8,922.1 237.9 744.2 837.9 12.6 Physical read (blocks): 3,750.0 12,560.9 235.0 1,057.0 1,179.6 11.6 Physical write (blocks): 1,729.0 5,855.0 238.6 487.4 549.8 12.8 Read IO requests: 3,749.3 12,560.8 235.0 1,056.8 1,179.6 11.6 Write IO requests: 1,538.2 5,219.5 239.3 433.6 490.2 13.0 Read IO (MB): 29.3 98.1 234.9 8.3 9.2 11.6 Write IO (MB): 13.5 45.7 238.6 3.8 4.3 12.9 IM scan rows: 0.0 0.0 0.0 0.0 0.0 0.0 Session Logical Read IM: User calls: 1.4 1.4 2.9 0.4 0.1 -66.7 Parses (SQL): 6.3 10.3 62.7 1.8 1.0 -45.5 Hard parses (SQL): 0.3 0.0 -86.2 0.1 0.0 -100.0 SQL Work Area (MB): 0.4 0.2 -48.8 0.1 0.0 -48.8 Logons: 0.1 0.1 0.0 0.0 0.0 -50.0 Executes (SQL): 23.7 58.0 144.3 6.7 5.4 -18.5 Transactions: 3.5 10.6 200.0 First Second Diff --------------- --------------- ------ % Blocks changed per Read: 50.5 52.6 2.1 Recursive Call %: 97.6 99.0 1.5 Rollback per transaction %: 0.0 0.0 0.0 Rows per Sort: 35.4 49.6 14.2 Avg DB time per Call (sec): 11.5 11.3 -0.3 Top Timed Events First DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec DB time: 1935.33 sec), Second DB/Inst: SLOB12/SLOB12 Snaps: 55-56 (Elapsed time: 121.705 sec DB time: 1923.77 sec) -> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes 1st 2nd ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Event Wait Class Waits Time(s) Avg Time(ms) %DB time Event Wait Class Waits Time(s) Avg Time(ms) %DB time ------------------------------ ------------- ------------ ------------ ------------- ----------- ------------------------------ ------------- ------------ ------------ ------------- ----------- db file sequential read User I/O 459,637 1,920.9 4.2 99.3 db file sequential read User I/O 1,528,725 1,840.4 1.2 95.7 db file async I/O submit System I/O 2,975 97.2 32.7 5.0 CPU time N/A 110.3 N/A 5.7 CPU time N/A 40.2 N/A 2.1 db file parallel write System I/O 114,997 34.3 0.3 1.8 log file parallel write System I/O 1,881 3.2 1.7 0.2 free buffer waits Configuration 2,318 28.0 12.1 1.5 db file parallel write System I/O 4,670 0.9 0.2 0.0 log file parallel write System I/O 6,897 14.3 2.1 0.7 control file sequential read System I/O 499 0.2 0.3 0.0 db file async I/O submit System I/O 12,819 5.2 0.4 0.3 oracle thread bootstrap Other 7 0.1 18.1 0.0 buffer busy waits Concurrency 407 2.7 6.7 0.1 Disk file operations I/O User I/O 77 0.1 0.9 0.0 Data file init write User I/O 152 0.9 5.7 0.0 library cache: mutex X Concurrency 22 0.1 2.7 0.0 read by other session User I/O 60 0.7 11.9 0.0 Data file init write User I/O 5 0.0 5.6 0.0 enq: HW - contention Configuration 18 0.7 39.3 0.0 -read by other session User I/O 18 0.0 0.6 0.0 -control file sequential read System I/O 841 0.5 0.5 0.0 -buffer busy waits Concurrency 82 0.0 0.0 0.0 -oracle thread bootstrap Other 7 0.1 17.2 0.0 - N/A N/A N/A N/A -Disk file operations I/O User I/O 136 0.1 0.8 0.0 - N/A N/A N/A N/A -library cache: mutex X Concurrency 21 0.0 1.4 0.0 --------------------------------------------------------------------------------------------------------------------
Everything is just better – the IO times, the redo size, the number of IOPS etc. That’s what I would have expected from the outset. Good to know-I’ll no longer use EXT4 for Oracle databases. After discussing this with @fritshoogland he showed me a post by @kevinclosson that didn’t come to mind at the time:
Yes, File Systems Still Need To Support Concurrent Writes! Yet Another Look At XFS Versus EXT4.
What I’d find interesting now would be if you had time to reformat the drive with ASM and see how that AWR report compared to xfs!