Interesting observations executing SLOB2 with ext4 and xfs on SSD

My lab server has 2 SSDs, one is connected using SATA 2 and another is connected using SATA 3. I’d expect the SATA 3 connected device to be equally well equipped or even better to do work than the “old” interface. I ran SLOB on these devices to find out if there was a difference. To my great surprise the SATA2 – connected SSD performed a lot better than the SATA 3 device, as shown in the AWR report! Initially I was not entirely sure why, since the FIO results on both devices are roughly equal. You will see why though when reading this post. In summary: use XFS for any concurrent writes. Or maybe ASM.

Let’s investigate

Let’s do a little I/O investigation because a) it’s cool and b) you can.

This is Oracle Linux 6.5 with UEK 3. A word of warning though-ext4 has been, let’s say, a little flaky. I first found out about that in 2011 (https://martincarstenbach.wordpress.com/2011/11/04/an-interesting-problem-with-ext4-on-oracle-linux-5-5/) and Oracle now published a MOS note: ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1). I am on a reasonably recent version so the observations in this post are probably not related to the above notes. The device in /dev/sda uses SATA2 and the file system /u01/app/oracle/oradata is XFS formatted. The other SSD, /dev/sde uses EXT4 but is connected via SATA3 provided by a PCIe v2 x4 card. I wish I had PCIe v3 but my Opteron 6200 system doesn’t support native SATA6G.

SLOB

I created a 12c non-CDB database for SLOB with the IOPS tablespace on my SATA 2 SSD that’s mounted on XFS. Every test has been repeated four times. Should you want to follow along, here is my slob.conf:

[/m/kvm/oracle]$ cat ~/SLOB2/SLOB/slob.conf 

UPDATE_PCT=20
RUN_TIME=120
WORK_LOOP=0
SCALE=10000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=8
SHARED_DATA_MODULUS=0

# Settings for SQL*Net connectivity:
#ADMIN_SQLNET_SERVICE=slob
#SQLNET_SERVICE_BASE=slob
#SQLNET_SERVICE_MAX=2
#SYSDBA_PASSWD="change_on_install"

export UPDATE_PCT RUN_TIME WORK_LOOP SCALE WORK_UNIT LOAD_PARALLEL_DEGREE REDO_STRESS SHARED_DATA_MODULUS

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]

Notice the UPDATE_PCT-that will become important later. The SLOB configuration has been created using a call to ./setup.sh with IOPS as the target tablespace and 128 users which is the default. My database uses this initialisation file shown below, which is more or lesse the same as the default (and my buffer cache is indeed 16M)

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
[/m/kvm/oracle]$ cat /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/initSLOB12.ora 

#set the following to paths that make sense on your system:
db_create_file_dest = '/u01/app/oracle/oradata'
control_files='/u01/app/oracle/oradata/SLOB12/controlfile/o1_mf_b52b4lvk_.ctl'

db_name = SLOB12
compatible = 12.1.0.2.0
undo_management = auto
db_block_size = 8192
db_files = 2000
processes = 500
shared_pool_size = 1G
db_cache_size = 10M     # Will force maximum physical I/O
#db_cache_size = 15G    # Sufficient for cached runs with reference scale (10,000 rows) and 128 sessions
filesystemio_options = setall
parallel_max_servers = 0
log_buffer = 134217728
cpu_count = 2
pga_aggregate_target = 1G
remote_login_passwordfile = exclusive
workarea_size_policy = auto
recyclebin = off

# The following are needed to force db file sequential read. Omit for direct path short scans.
# If you would like to investigate a blend of short scans mixed with single-block random I/O
# then comment out these three parameters :
_db_block_prefetch_limit = 0
_db_block_prefetch_quota = 0
_db_file_noncontig_mblock_read_count = 0

# Uncomment the following if suffering library cache contention (at larger numbers of sessions) or high log file sync waits
# at CPU-saturated levels.
#_cursor_obsolete_threshold=130
#_high_priority_processes=LGWR

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]

After the fourth execution of ./runit.sh 16 I moved the IOPS data file from /dev/sda to /dev/sde and ran another 4 tests.

Observations

I ran an AWR difference report to compare the performance of the XFS and EXT4 SSDs. The most interesting bit is right on the top. Apologies for the very wide output-that’s the way the text report lays it out.

WORKLOAD REPOSITORY COMPARE PERIOD REPORT

Snapshot Set    DB Id    Instance     Inst Num Release     Cluster Host          Std Block Size
------------ ----------- ------------ -------- ----------- ------- ------------ ---------------
First (1st)   1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
Second (2nd)  1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192

Snapshot Set  Begin Snap Id Begin Snap Time            End Snap Id End Snap Time                  Avg Active Users           Elapsed Time (min)            DB time (min)
------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- --------------------------
1st                      41 31-Oct-14 12:18:14 (Fri)           42 31-Oct-14 12:20:17 (Fri)                      15.8                       2.0                      32.1
2nd                      49 31-Oct-14 12:38:18 (Fri)           50 31-Oct-14 12:40:20 (Fri)                      15.8                       2.0                      32.3
                                                                                      ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                      %Diff:                     0.2%                       0.0%                       0.5%

Host Configuration Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                     1st                  2nd                 Diff     %Diff
----------------------------------- -------------------- -------------------- -------------------- ---------
Number of CPUs:                                       24                   24                    0       0.0
Number of CPU Cores:                                  24                   24                    0       0.0
Number of CPU Sockets:                                 2                    2                    0       0.0
Physical Memory:                                64413.7M             64413.7M                   0M       0.0
Load at Start Snapshot:                             9.13                 9.57                  .44       4.8
Load at End Snapshot:                              15.38                 15.6                  .22       1.4
%User Time:                                         1.24                  .64                  -.6     -48.4
%System Time:                                        .75                  .74                 -.02      -1.3
%Idle Time:                                        97.89                 98.6                  .71       0.7
%IO Wait Time:                                     35.21                 9.06               -26.15     -74.3
Cache Sizes
~~~~~~~~~~~
                          1st (M)    2nd (M)   Diff (M)    %Diff
---------------------- ---------- ---------- ---------- --------
Memory Target
.....SGA Target
..........Buffer Cache       16.0       16.0        0.0      0.0
..........Shared Pool     1,024.0    1,024.0        0.0      0.0
..........Large Pool
..........Java Pool          16.0       16.0        0.0      0.0
..........Streams Pool
.....PGA Target           1,024.0    1,024.0        0.0      0.0
Log Buffer                  128.0      128.0        0.0      0.0
In-Memory Area

Workload Comparison
~~~~~~~~~~~~~~~~~~~                      1st Per Sec          2nd Per Sec      %Diff              1st Per Txn          2nd Per Txn      %Diff
                                     ---------------      ---------------     ------          ---------------      ---------------     ------
                       DB time:                 15.8                 15.8        0.2                      2.5                  4.4       78.7
                      CPU time:                  0.5                  0.3      -29.8                      0.1                  0.1       28.6
           Background CPU time:                  0.1                  0.1      -40.0                      0.0                  0.0        0.0
             Redo size (bytes):          8,721,293.5          4,652,147.0      -46.7              1,379,498.8          1,311,306.6       -4.9
         Logical read (blocks):             10,403.9              5,231.1      -49.7                  1,645.6              1,474.5      -10.4
                 Block changes:              5,216.5              2,640.2      -49.4                    825.1                744.2       -9.8
        Physical read (blocks):              7,412.5              3,750.0      -49.4                  1,172.5              1,057.0       -9.8
       Physical write (blocks):              3,369.0              1,729.0      -48.7                    532.9                487.4       -8.5
              Read IO requests:              7,411.8              3,749.3      -49.4                  1,172.4              1,056.8       -9.9
             Write IO requests:              3,018.0              1,538.2      -49.0                    477.4                433.6       -9.2
                  Read IO (MB):                 57.9                 29.3      -49.4                      9.2                  8.3       -9.8
                 Write IO (MB):                 26.3                 13.5      -48.7                      4.2                  3.8       -8.4
                  IM scan rows:                  0.0                  0.0        0.0                      0.0                  0.0        0.0
       Session Logical Read IM:
                    User calls:                  1.4                  1.4       -1.4                      0.2                  0.4       77.3
                  Parses (SQL):                  5.9                  6.3        7.5                      0.9                  1.8       91.4
             Hard parses (SQL):                  0.1                  0.3      107.1                      0.0                  0.1      300.0
            SQL Work Area (MB):                  0.6                  0.4      -25.9                      0.1                  0.1      -25.9
                        Logons:                  0.1                  0.1        0.0                      0.0                  0.0      100.0
                Executes (SQL):                 38.1                 23.7      -37.6                      6.0                  6.7       11.1
                  Transactions:                  6.3                  3.5      -43.8

                                               First               Second       Diff
                                     ---------------      ---------------     ------
     % Blocks changed per Read:                 50.1                 50.5        0.3
              Recursive Call %:                 98.1                 97.6       -0.6
    Rollback per transaction %:                  0.0                  0.0        0.0
                 Rows per Sort:                 68.3                 35.4      -32.9
    Avg DB time per Call (sec):                 11.3                 11.5        0.2</pre>
<pre>...

I copied the header from the report to show you that nothing really changed between the test executions, and that I’m going to do a lot of I/O since the buffer cache is only 16 MB.

What is striking is that the physical reads and writes are roughly half of what I saw during the first batch of executions. The culprit is found quickly. In this scenario I chose SLOB is primarily performing sequential reads:

Top Timed Events   First DB/Inst: SLOB12/SLOB12 Snaps: 41-42 (Elapsed time: 122.27 sec  DB time: 1925.83 sec),  Second DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec  DB time: 1935.33 sec)
-> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes

                                               1st                                                                                                2nd
------------------------------------------------------------------------------------------------   ------------------------------------------------------------------------------------------------
Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time   Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time
------------------------------ ------------- ------------ ------------ ------------- -----------   ------------------------------ ------------- ------------ ------------ ------------- -----------
 db file sequential read       User I/O           906,150      1,899.1           2.1        98.6    db file sequential read       User I/O           459,637      1,920.9           4.2        99.3
 db file parallel write        System I/O         116,318         60.0           0.5         3.1    db file async I/O submit      System I/O           2,975         97.2          32.7         5.0
 CPU time                                             N/A         57.7           N/A         3.0    CPU time                                             N/A         40.2           N/A         2.1
 log file parallel write       System I/O           3,485         47.4          13.6         2.5    log file parallel write       System I/O           1,881          3.2           1.7         0.2
 db file async I/O submit      System I/O           7,425          3.7           0.5         0.2    db file parallel write        System I/O           4,670          0.9           0.2         0.0
...

The IO times for the second interval (the EXT4 case) are twice as what they are on the first. But why? I thought at first that the PCIe card adds the extra latency. And I came up with a few other ideas that all remained that: ideas. Ideas are nice, but proof is better. Maybe there is indeed something odd? Or maybe there is something in the way that Oracle uses the device? To find out I used FIO. FIO is my favourite I/O benchmark when it comes to measuring raw performance. If the FIO benchmark results match the Oracle results then indeed there must be something wrong with the SSD or the way it is addressed. I spent a bit of time to compile FIO and ran a set of tests.

FIO

I was really quite curious to see what was happening during the execution of FIO-was the SATA3 SSD really slower? So I started FIO tests, using FIO 2.1.3. Using the same test harness, just the destination is different:

[~/fio-2.1.13]$ cat rand-read-sd*
[rand-read]
numjobs=4
rw=randread
directory=/u01/fio/data
size=5G
bs=8k
ioengine=libaio
buffered=0
iodepth=8

[rand-read]
numjobs=4
rw=randread
directory=/m/kvm/oracle
size=5G
bs=8k
ioengine=libaio
buffered=0
iodepth=8

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
[~/fio-2.1.13]$

The /u01 file system sits on /dev/sda, and uses XFS. /m/kvm/oracle is on /dev/sde, using EXT4. Here is the output from the tests:

[~/fio-2.1.13]$ ./fio rand-read-sde.fio
rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
...
fio-2.1.13
Starting 4 processes
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 3 (f=3): [r(3),_(1)] [99.2% done] [192.8MB/0KB/0KB /s] [24.7K/0/0 iops] [eta 00m:01s]
rand-read: (groupid=0, jobs=1): err= 0: pid=6124: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=44174KB/s, iops=5521, runt=118686msec
    slat (usec): min=4, max=109, avg=13.06, stdev= 6.14
    clat (usec): min=128, max=43082, avg=1434.23, stdev=586.65
     lat (usec): min=140, max=43094, avg=1447.47, stdev=586.41
    clat percentiles (usec):
     |  1.00th=[  516],  5.00th=[  708], 10.00th=[  820], 20.00th=[  988],
     | 30.00th=[ 1112], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480],
     | 70.00th=[ 1608], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480],
     | 99.00th=[ 3184], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4576],
     | 99.99th=[ 6752]
    bw (KB  /s): min=27088, max=70928, per=24.99%, avg=44119.50, stdev=6314.88
    lat (usec) : 250=0.01%, 500=0.78%, 750=6.02%, 1000=14.12%
    lat (msec) : 2=65.54%, 4=13.37%, 10=0.15%, 20=0.01%, 50=0.01%
  cpu          : usr=1.89%, sys=11.05%, ctx=421189, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6125: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=44138KB/s, iops=5517, runt=118784msec
    slat (usec): min=4, max=108, avg=13.14, stdev= 6.14
    clat (usec): min=117, max=43252, avg=1435.31, stdev=587.66
     lat (usec): min=124, max=43264, avg=1448.64, stdev=587.40
    clat percentiles (usec):
     |  1.00th=[  502],  5.00th=[  700], 10.00th=[  820], 20.00th=[  988],
     | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480],
     | 70.00th=[ 1624], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480],
     | 99.00th=[ 3152], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4640],
     | 99.99th=[ 6688]
    bw (KB  /s): min=27504, max=69760, per=24.91%, avg=43972.22, stdev=6169.69
    lat (usec) : 250=0.07%, 500=0.92%, 750=5.84%, 1000=13.94%
    lat (msec) : 2=65.68%, 4=13.40%, 10=0.16%, 20=0.01%, 50=0.01%
  cpu          : usr=1.89%, sys=11.10%, ctx=421379, majf=0, minf=46
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6126: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=44230KB/s, iops=5528, runt=118538msec
    slat (usec): min=4, max=120, avg=12.59, stdev= 5.61
    clat (usec): min=185, max=42871, avg=1433.09, stdev=581.75
     lat (usec): min=196, max=42881, avg=1445.85, stdev=581.45
    clat percentiles (usec):
     |  1.00th=[  532],  5.00th=[  716], 10.00th=[  828], 20.00th=[  988],
     | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1464],
     | 70.00th=[ 1608], 80.00th=[ 1800], 90.00th=[ 2128], 95.00th=[ 2480],
     | 99.00th=[ 3152], 99.50th=[ 3440], 99.90th=[ 4192], 99.95th=[ 4576],
     | 99.99th=[ 7008]
    bw (KB  /s): min=27600, max=65952, per=25.01%, avg=44149.65, stdev=5945.98
    lat (usec) : 250=0.01%, 500=0.63%, 750=5.90%, 1000=14.17%
    lat (msec) : 2=65.89%, 4=13.25%, 10=0.15%, 20=0.01%, 50=0.01%
  cpu          : usr=1.74%, sys=11.20%, ctx=426219, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6127: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=47359KB/s, iops=5919, runt=110704msec
    slat (usec): min=4, max=114, avg=12.83, stdev= 5.90
    clat (usec): min=188, max=42988, avg=1337.06, stdev=571.06
     lat (usec): min=198, max=43000, avg=1350.07, stdev=570.63
    clat percentiles (usec):
     |  1.00th=[  446],  5.00th=[  620], 10.00th=[  748], 20.00th=[  900],
     | 30.00th=[ 1020], 40.00th=[ 1144], 50.00th=[ 1256], 60.00th=[ 1368],
     | 70.00th=[ 1512], 80.00th=[ 1704], 90.00th=[ 2040], 95.00th=[ 2352],
     | 99.00th=[ 3024], 99.50th=[ 3280], 99.90th=[ 3984], 99.95th=[ 4384],
     | 99.99th=[ 6624]
    bw (KB  /s): min=33664, max=51008, per=26.83%, avg=47369.34, stdev=4051.34
    lat (usec) : 250=0.01%, 500=1.45%, 750=8.72%, 1000=17.95%
    lat (msec) : 2=61.12%, 4=10.65%, 10=0.09%, 20=0.01%, 50=0.01%
  cpu          : usr=1.98%, sys=11.60%, ctx=410433, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: io=20480MB, aggrb=176551KB/s, minb=44137KB/s, maxb=47359KB/s, mint=110704msec, maxt=118784msec

Disk stats (read/write):
  sde: ios=2619049/4, merge=0/1, ticks=3663514/6, in_queue=3662827, util=100.00%

The best tool to look at IO without tracing block devices is collectl (in my opinion). I ran the extended disk stats in parallel:

[root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sde
waiting for 5 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sde              0      0    0    0  125340    129  123 1021    1020   219  1749      8   99
sde              0      0    0    0  119608     58  118 1017    1017   156  1666      7   88
sde         146968      0  18K    8       0      0    0    0       8    31     1      0   89
sde         165435      0  20K    8       2      0    0    6       7    31     1      0   99
sde         187097      0  23K    8       0      0    0    0       8    31     1      0  100
sde         187113      0  23K    8       0      0    0    0       8    31     1      0   99
sde         157684      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186542      0  23K    8       1      0    0    4       7    31     1      0  100
sde         186721      0  23K    8       0      0    0    0       8    31     1      0   99
sde         158416      0  19K    8       0      0    0    0       8    31     1      0  100
sde         187168      0  23K    8       0      0    0    0       8    31     1      0   99
sde         186027      0  23K    8       0      0    0    0       8    31     1      0  100
sde         158366      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186028      0  23K    8       0      0    0    0       8    31     1      0   99
sde         186416      0  23K    8       0      0    0    0       8    31     1      0  100
sde         159102      0  19K    8       0      0    0    0       8    31     1      0  100
sde         186348      0  23K    8       0      0    0    0       8    31     1      0  100
sde         186859      0  23K    8       0      0    0    0       8    31     1      0   99
sde         153996      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186043      0  23K    8       0      0    0    0       8    31     1      0  100
sde         185348      0  23K    8       0      0    0    0       8    31     1      0   99
sde         158150      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186148      0  23K    8       0      0    0    0       8    31     1      0   99
sde         185900      0  23K    8       0      0    0    0       8    31     1      0   99
sde         159353      0  19K    8       0      0    0    0       8    25     1      0   99
sde         167065      0  20K    8       0      0    0    0       8    22     0      0   86
sde              0      0    0    0       0      0    0    0       0     0     0      0    0

This matches the FIO output-and indeed, the device can do around 23k IOPS with an 8k IO size. What about the XFS mount?

[~/fio-2.1.13]$ ./fio rand-read-sda.fio
rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
...
fio-2.1.13
Starting 4 processes
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 4 (f=4): [r(4)] [100.0% done] [191.3MB/0KB/0KB /s] [24.5K/0/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=1): err= 0: pid=6329: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49087KB/s, iops=6135, runt=106809msec
    slat (usec): min=3, max=93, avg= 7.90, stdev= 6.40
    clat (usec): min=280, max=35113, avg=1294.26, stdev=152.91
     lat (usec): min=285, max=35130, avg=1302.35, stdev=152.60
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1240],
     | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224],
     | 99.99th=[ 3024]
    bw (KB  /s): min=46000, max=49536, per=25.02%, avg=49124.10, stdev=291.98
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.03%
    lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.39%, sys=8.05%, ctx=505195, majf=0, minf=43
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6330: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49093KB/s, iops=6136, runt=106795msec
    slat (usec): min=3, max=114, avg= 7.55, stdev= 5.74
    clat (usec): min=263, max=35353, avg=1294.46, stdev=155.68
     lat (usec): min=271, max=35369, avg=1302.21, stdev=155.46
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1256],
     | 30.00th=[ 1272], 40.00th=[ 1288], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224],
     | 99.99th=[ 3184]
    bw (KB  /s): min=45728, max=49472, per=25.02%, avg=49123.20, stdev=300.30
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02%
    lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.42%, sys=7.88%, ctx=549312, majf=0, minf=46
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6331: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec
    slat (usec): min=3, max=137, avg= 7.70, stdev= 5.87
    clat (usec): min=266, max=35102, avg=1294.64, stdev=147.64
     lat (usec): min=274, max=35121, avg=1302.52, stdev=147.39
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256],
     | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 2064], 99.95th=[ 2256],
     | 99.99th=[ 3600]
    bw (KB  /s): min=45888, max=49536, per=25.02%, avg=49110.62, stdev=295.81
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.04%
    lat (msec) : 2=99.84%, 4=0.10%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.32%, sys=8.07%, ctx=536841, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6332: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec
    slat (usec): min=3, max=106, avg= 7.66, stdev= 5.93
    clat (usec): min=274, max=35268, avg=1294.67, stdev=157.16
     lat (usec): min=281, max=35286, avg=1302.53, stdev=156.91
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256],
     | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1944], 99.95th=[ 2224],
     | 99.99th=[ 3536]
    bw (KB  /s): min=45840, max=49536, per=25.02%, avg=49109.56, stdev=303.79
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.05%
    lat (msec) : 2=99.84%, 4=0.09%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.50%, sys=7.83%, ctx=533469, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: io=20480MB, aggrb=196318KB/s, minb=49079KB/s, maxb=49092KB/s, mint=106795msec, maxt=106824msec

Disk stats (read/write):
  sda: ios=2620077/60, merge=0/2, ticks=3348979/94, in_queue=3348631, util=100.00%
[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
[~/fio-2.1.13]$

Here is some of the corresponding collectl output:

[root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sda
waiting for 5 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sda         196313      0  24K    8       5      0    1    9       8    31     1      0   99
sda         196681      0  24K    8       6      0    0   16       8    31     1      0  100
sda         196369      0  24K    8       7      0    1   12       8    31     1      0   99
sda         196217      0  24K    8       4      0    0   10       8    31     1      0  100
sda         196300      0  24K    8      14      0    2    6       8    31     1      0  100
sda         196699      0  24K    8       6      0    0   16       8    31     1      0   99
sda         196340      0  24K    8       3      0    0   16       8    31     1      0   99
sda         196350      0  24K    8       6      0    0   16       8    31     1      0  100
sda         196379      0  24K    8       6      0    0   16       8    31     1      0   99
sda         196560      0  24K    8       3      0    0   16       8    31     1      0   99
sda         196259      0  24K    8       8      0    1    8       8    31     1      0   99

Summary Number 1

So it would appear that the SSDs are more or less equal from FIO’s point of view. If you followed closely (hurray if you did-I appreciate this is a long post!) you will have noticed the following:

The SLOB test used updates (20%). There is furthermore just one data file for the IOPS tablespace (-> there is concurrency!)
The FIO test was based on random reads, no writes. It’s 4 jobs also worked on 4 distinct files.

I have tried to avoid the effect of file system buffering by setting buffer to 0 and I’m also using asynchronous IO as provided by libaio (just as Oracle does).

Theory Number 2

It would appear that XFS is better suited for concurrent writes. Let’s put that theory into action. I reformatted the SATA3 SSD with XFS and ran another test. Feast your eyes-comparing the AWR reports of the EXT4 test execution to XFS on /dev/sde:

WORKLOAD REPOSITORY COMPARE PERIOD REPORT

Snapshot Set    DB Id    Instance     Inst Num Release     Cluster Host          Std Block Size
------------ ----------- ------------ -------- ----------- ------- ------------ ---------------
First (1st)   1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
Second (2nd)  1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192

Snapshot Set  Begin Snap Id Begin Snap Time            End Snap Id End Snap Time                  Avg Active Users           Elapsed Time (min)            DB time (min)
------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- --------------------------
1st                      49 31-Oct-14 12:38:18 (Fri)           50 31-Oct-14 12:40:20 (Fri)                      15.8                       2.0                      32.3
2nd                      55 31-Oct-14 15:35:23 (Fri)           56 31-Oct-14 15:37:25 (Fri)                      15.7                       2.0                      32.1
                                                                                      ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                      %Diff:                     0.2%                      -0.5%                      -0.6%

Host Configuration Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                     1st                  2nd                 Diff     %Diff
----------------------------------- -------------------- -------------------- -------------------- ---------
Number of CPUs:                                       24                   24                    0       0.0
Number of CPU Cores:                                  24                   24                    0       0.0
Number of CPU Sockets:                                 2                    2                    0       0.0
Physical Memory:                                64413.7M             64413.7M                   0M       0.0
Load at Start Snapshot:                             9.57                 1.77                 -7.8     -81.5
Load at End Snapshot:                               15.6                14.15                -1.45      -9.3
%User Time:                                          .64                 2.07                 1.43     223.4
%System Time:                                        .74                 1.48                  .75     100.0
%Idle Time:                                         98.6                 96.4                 -2.2      -2.2
%IO Wait Time:                                      9.06                53.29                44.23     488.2
Cache Sizes
~~~~~~~~~~~
                          1st (M)    2nd (M)   Diff (M)    %Diff
---------------------- ---------- ---------- ---------- --------
Memory Target
.....SGA Target
..........Buffer Cache       16.0       16.0        0.0      0.0
..........Shared Pool     1,024.0    1,024.0        0.0      0.0
..........Large Pool
..........Java Pool          16.0       16.0        0.0      0.0
..........Streams Pool
.....PGA Target           1,024.0    1,024.0        0.0      0.0
Log Buffer                  128.0      128.0        0.0      0.0
In-Memory Area

Workload Comparison
~~~~~~~~~~~~~~~~~~~                      1st Per Sec          2nd Per Sec      %Diff              1st Per Txn          2nd Per Txn      %Diff
                                     ---------------      ---------------     ------          ---------------      ---------------     ------
                       DB time:                 15.8                 15.8        0.2                      4.4                  1.5      -66.7
                      CPU time:                  0.3                  0.9      175.8                      0.1                  0.1        0.0
           Background CPU time:                  0.1                  0.1      133.3                      0.0                  0.0      -50.0
             Redo size (bytes):          4,652,147.0         15,186,494.1      226.4              1,311,306.6          1,426,136.0        8.8
         Logical read (blocks):              5,231.1             16,957.8      224.2                  1,474.5              1,592.5        8.0
                 Block changes:              2,640.2              8,922.1      237.9                    744.2                837.9       12.6
        Physical read (blocks):              3,750.0             12,560.9      235.0                  1,057.0              1,179.6       11.6
       Physical write (blocks):              1,729.0              5,855.0      238.6                    487.4                549.8       12.8
              Read IO requests:              3,749.3             12,560.8      235.0                  1,056.8              1,179.6       11.6
             Write IO requests:              1,538.2              5,219.5      239.3                    433.6                490.2       13.0
                  Read IO (MB):                 29.3                 98.1      234.9                      8.3                  9.2       11.6
                 Write IO (MB):                 13.5                 45.7      238.6                      3.8                  4.3       12.9
                  IM scan rows:                  0.0                  0.0        0.0                      0.0                  0.0        0.0
       Session Logical Read IM:
                    User calls:                  1.4                  1.4        2.9                      0.4                  0.1      -66.7
                  Parses (SQL):                  6.3                 10.3       62.7                      1.8                  1.0      -45.5
             Hard parses (SQL):                  0.3                  0.0      -86.2                      0.1                  0.0     -100.0
            SQL Work Area (MB):                  0.4                  0.2      -48.8                      0.1                  0.0      -48.8
                        Logons:                  0.1                  0.1        0.0                      0.0                  0.0      -50.0
                Executes (SQL):                 23.7                 58.0      144.3                      6.7                  5.4      -18.5
                  Transactions:                  3.5                 10.6      200.0

                                               First               Second       Diff
                                     ---------------      ---------------     ------
     % Blocks changed per Read:                 50.5                 52.6        2.1
              Recursive Call %:                 97.6                 99.0        1.5
    Rollback per transaction %:                  0.0                  0.0        0.0
                 Rows per Sort:                 35.4                 49.6       14.2
    Avg DB time per Call (sec):                 11.5                 11.3       -0.3

Top Timed Events   First DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec  DB time: 1935.33 sec),  Second DB/Inst: SLOB12/SLOB12 Snaps: 55-56 (Elapsed time: 121.705 sec  DB time: 1923.77 sec)
-> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes

                                               1st                                                                                                2nd
------------------------------------------------------------------------------------------------   ------------------------------------------------------------------------------------------------
Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time   Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time
------------------------------ ------------- ------------ ------------ ------------- -----------   ------------------------------ ------------- ------------ ------------ ------------- -----------
 db file sequential read       User I/O           459,637      1,920.9           4.2        99.3    db file sequential read       User I/O         1,528,725      1,840.4           1.2        95.7
 db file async I/O submit      System I/O           2,975         97.2          32.7         5.0    CPU time                                             N/A        110.3           N/A         5.7
 CPU time                                             N/A         40.2           N/A         2.1    db file parallel write        System I/O         114,997         34.3           0.3         1.8
 log file parallel write       System I/O           1,881          3.2           1.7         0.2    free buffer waits             Configuration        2,318         28.0          12.1         1.5
 db file parallel write        System I/O           4,670          0.9           0.2         0.0    log file parallel write       System I/O           6,897         14.3           2.1         0.7
 control file sequential read  System I/O             499          0.2           0.3         0.0    db file async I/O submit      System I/O          12,819          5.2           0.4         0.3
 oracle thread bootstrap       Other                    7          0.1          18.1         0.0    buffer busy waits             Concurrency            407          2.7           6.7         0.1
 Disk file operations I/O      User I/O                77          0.1           0.9         0.0    Data file init write          User I/O               152          0.9           5.7         0.0
 library cache: mutex X        Concurrency             22          0.1           2.7         0.0    read by other session         User I/O                60          0.7          11.9         0.0
 Data file init write          User I/O                 5          0.0           5.6         0.0    enq: HW - contention          Configuration           18          0.7          39.3         0.0
-read by other session         User I/O                18          0.0           0.6         0.0   -control file sequential read  System I/O             841          0.5           0.5         0.0
-buffer busy waits             Concurrency             82          0.0           0.0         0.0   -oracle thread bootstrap       Other                    7          0.1          17.2         0.0
-                                                     N/A          N/A           N/A         N/A   -Disk file operations I/O      User I/O               136          0.1           0.8         0.0
-                                                     N/A          N/A           N/A         N/A   -library cache: mutex X        Concurrency             21          0.0           1.4         0.0
                          --------------------------------------------------------------------------------------------------------------------

Everything is just better – the IO times, the redo size, the number of IOPS etc. That’s what I would have expected from the outset. Good to know-I’ll no longer use EXT4 for Oracle databases. After discussing this with @fritshoogland he showed me a post by @kevinclosson that didn’t come to mind at the time:

Yes, File Systems Still Need To Support Concurrent Writes! Yet Another Look At XFS Versus EXT4.

Response

Jeremy

December 23, 2014

What I’d find interesting now would be if you had time to reformat the drive with ASM and see how that AWR report compared to xfs!