Monthly Archives: October 2014

Interesting observations executing SLOB2 with ext4 and xfs on SSD

My lab server has 2 SSDs, one is connected using SATA 2 and another is connected using SATA 3. I’d expect the SATA 3 connected device to be equally well equipped or even better to do work than the “old” interface. I ran SLOB on these devices to find out if there was a difference. To my great surprise the SATA2 – connected SSD performed a lot better than the SATA 3 device, as shown in the AWR report! Initially I was not entirely sure why, since the FIO results on both devices are roughly equal. You will see why though when reading this post. In summary: use XFS for any concurrent writes. Or maybe ASM.

Let’s investigate

Let’s do a little I/O investigation because a) it’s cool and b) you can.

This is Oracle Linux 6.5 with UEK 3. A word of warning though-ext4 has been, let’s say, a little flaky. I first found out about that in 2011 (https://martincarstenbach.wordpress.com/2011/11/04/an-interesting-problem-with-ext4-on-oracle-linux-5-5/) and Oracle now published a MOS note: ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1). I am on a reasonably recent version so the observations in this post are probably not related to the above notes. The device in /dev/sda uses SATA2 and the file system /u01/app/oracle/oradata is XFS formatted. The other SSD, /dev/sde uses EXT4 but is connected via SATA3 provided by a PCIe v2 x4 card. I wish I had PCIe v3 but my Opteron 6200 system doesn’t support native SATA6G.

SLOB

I created a 12c non-CDB database for SLOB with the IOPS tablespace on my SATA 2 SSD that’s mounted on XFS. Every test has been repeated four times. Should you want to follow along, here is my slob.conf:

[/m/kvm/oracle]$ cat ~/SLOB2/SLOB/slob.conf 

UPDATE_PCT=20
RUN_TIME=120
WORK_LOOP=0
SCALE=10000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=8
SHARED_DATA_MODULUS=0

# Settings for SQL*Net connectivity:
#ADMIN_SQLNET_SERVICE=slob
#SQLNET_SERVICE_BASE=slob
#SQLNET_SERVICE_MAX=2
#SYSDBA_PASSWD="change_on_install"

export UPDATE_PCT RUN_TIME WORK_LOOP SCALE WORK_UNIT LOAD_PARALLEL_DEGREE REDO_STRESS SHARED_DATA_MODULUS

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]

Notice the UPDATE_PCT-that will become important later. The SLOB configuration has been created using a call to ./setup.sh with IOPS as the target tablespace and 128 users which is the default. My database uses this initialisation file shown below, which is more or lesse the same as the default (and my buffer cache is indeed 16M)

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
[/m/kvm/oracle]$ cat /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/initSLOB12.ora 

#set the following to paths that make sense on your system:
db_create_file_dest = '/u01/app/oracle/oradata'
control_files='/u01/app/oracle/oradata/SLOB12/controlfile/o1_mf_b52b4lvk_.ctl'

db_name = SLOB12
compatible = 12.1.0.2.0
undo_management = auto
db_block_size = 8192
db_files = 2000
processes = 500
shared_pool_size = 1G
db_cache_size = 10M     # Will force maximum physical I/O
#db_cache_size = 15G    # Sufficient for cached runs with reference scale (10,000 rows) and 128 sessions
filesystemio_options = setall
parallel_max_servers = 0
log_buffer = 134217728
cpu_count = 2
pga_aggregate_target = 1G
remote_login_passwordfile = exclusive
workarea_size_policy = auto
recyclebin = off

# The following are needed to force db file sequential read. Omit for direct path short scans.
# If you would like to investigate a blend of short scans mixed with single-block random I/O
# then comment out these three parameters :
_db_block_prefetch_limit = 0
_db_block_prefetch_quota = 0
_db_file_noncontig_mblock_read_count = 0

# Uncomment the following if suffering library cache contention (at larger numbers of sessions) or high log file sync waits
# at CPU-saturated levels.
#_cursor_obsolete_threshold=130
#_high_priority_processes=LGWR

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]

After the fourth execution of ./runit.sh 16 I moved the IOPS data file from /dev/sda to /dev/sde and ran another 4 tests.

Observations

I ran an AWR difference report to compare the performance of the XFS and EXT4 SSDs. The most interesting bit is right on the top. Apologies for the very wide output-that’s the way the text report lays it out.

WORKLOAD REPOSITORY COMPARE PERIOD REPORT

Snapshot Set    DB Id    Instance     Inst Num Release     Cluster Host          Std Block Size
------------ ----------- ------------ -------- ----------- ------- ------------ ---------------
First (1st)   1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
Second (2nd)  1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192

Snapshot Set  Begin Snap Id Begin Snap Time            End Snap Id End Snap Time                  Avg Active Users           Elapsed Time (min)            DB time (min)
------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- --------------------------
1st                      41 31-Oct-14 12:18:14 (Fri)           42 31-Oct-14 12:20:17 (Fri)                      15.8                       2.0                      32.1
2nd                      49 31-Oct-14 12:38:18 (Fri)           50 31-Oct-14 12:40:20 (Fri)                      15.8                       2.0                      32.3
                                                                                      ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                      %Diff:                     0.2%                       0.0%                       0.5%

Host Configuration Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                     1st                  2nd                 Diff     %Diff
----------------------------------- -------------------- -------------------- -------------------- ---------
Number of CPUs:                                       24                   24                    0       0.0
Number of CPU Cores:                                  24                   24                    0       0.0
Number of CPU Sockets:                                 2                    2                    0       0.0
Physical Memory:                                64413.7M             64413.7M                   0M       0.0
Load at Start Snapshot:                             9.13                 9.57                  .44       4.8
Load at End Snapshot:                              15.38                 15.6                  .22       1.4
%User Time:                                         1.24                  .64                  -.6     -48.4
%System Time:                                        .75                  .74                 -.02      -1.3
%Idle Time:                                        97.89                 98.6                  .71       0.7
%IO Wait Time:                                     35.21                 9.06               -26.15     -74.3
Cache Sizes
~~~~~~~~~~~
                          1st (M)    2nd (M)   Diff (M)    %Diff
---------------------- ---------- ---------- ---------- --------
Memory Target
.....SGA Target
..........Buffer Cache       16.0       16.0        0.0      0.0
..........Shared Pool     1,024.0    1,024.0        0.0      0.0
..........Large Pool
..........Java Pool          16.0       16.0        0.0      0.0
..........Streams Pool
.....PGA Target           1,024.0    1,024.0        0.0      0.0
Log Buffer                  128.0      128.0        0.0      0.0
In-Memory Area

Workload Comparison
~~~~~~~~~~~~~~~~~~~                      1st Per Sec          2nd Per Sec      %Diff              1st Per Txn          2nd Per Txn      %Diff
                                     ---------------      ---------------     ------          ---------------      ---------------     ------
                       DB time:                 15.8                 15.8        0.2                      2.5                  4.4       78.7
                      CPU time:                  0.5                  0.3      -29.8                      0.1                  0.1       28.6
           Background CPU time:                  0.1                  0.1      -40.0                      0.0                  0.0        0.0
             Redo size (bytes):          8,721,293.5          4,652,147.0      -46.7              1,379,498.8          1,311,306.6       -4.9
         Logical read (blocks):             10,403.9              5,231.1      -49.7                  1,645.6              1,474.5      -10.4
                 Block changes:              5,216.5              2,640.2      -49.4                    825.1                744.2       -9.8
        Physical read (blocks):              7,412.5              3,750.0      -49.4                  1,172.5              1,057.0       -9.8
       Physical write (blocks):              3,369.0              1,729.0      -48.7                    532.9                487.4       -8.5
              Read IO requests:              7,411.8              3,749.3      -49.4                  1,172.4              1,056.8       -9.9
             Write IO requests:              3,018.0              1,538.2      -49.0                    477.4                433.6       -9.2
                  Read IO (MB):                 57.9                 29.3      -49.4                      9.2                  8.3       -9.8
                 Write IO (MB):                 26.3                 13.5      -48.7                      4.2                  3.8       -8.4
                  IM scan rows:                  0.0                  0.0        0.0                      0.0                  0.0        0.0
       Session Logical Read IM:
                    User calls:                  1.4                  1.4       -1.4                      0.2                  0.4       77.3
                  Parses (SQL):                  5.9                  6.3        7.5                      0.9                  1.8       91.4
             Hard parses (SQL):                  0.1                  0.3      107.1                      0.0                  0.1      300.0
            SQL Work Area (MB):                  0.6                  0.4      -25.9                      0.1                  0.1      -25.9
                        Logons:                  0.1                  0.1        0.0                      0.0                  0.0      100.0
                Executes (SQL):                 38.1                 23.7      -37.6                      6.0                  6.7       11.1
                  Transactions:                  6.3                  3.5      -43.8

                                               First               Second       Diff
                                     ---------------      ---------------     ------
     % Blocks changed per Read:                 50.1                 50.5        0.3
              Recursive Call %:                 98.1                 97.6       -0.6
    Rollback per transaction %:                  0.0                  0.0        0.0
                 Rows per Sort:                 68.3                 35.4      -32.9
    Avg DB time per Call (sec):                 11.3                 11.5        0.2</pre>
<pre>...

I copied the header  from the report  to show you that nothing really changed between the test executions, and that I’m going to do a lot of I/O since the buffer cache is only 16 MB.

What is striking is that the physical reads and writes are roughly half of what I saw during the first batch of executions. The culprit is found quickly. In this scenario I chose SLOB is primarily performing sequential reads:

Top Timed Events   First DB/Inst: SLOB12/SLOB12 Snaps: 41-42 (Elapsed time: 122.27 sec  DB time: 1925.83 sec),  Second DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec  DB time: 1935.33 sec)
-> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes

                                               1st                                                                                                2nd
------------------------------------------------------------------------------------------------   ------------------------------------------------------------------------------------------------
Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time   Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time
------------------------------ ------------- ------------ ------------ ------------- -----------   ------------------------------ ------------- ------------ ------------ ------------- -----------
 db file sequential read       User I/O           906,150      1,899.1           2.1        98.6    db file sequential read       User I/O           459,637      1,920.9           4.2        99.3
 db file parallel write        System I/O         116,318         60.0           0.5         3.1    db file async I/O submit      System I/O           2,975         97.2          32.7         5.0
 CPU time                                             N/A         57.7           N/A         3.0    CPU time                                             N/A         40.2           N/A         2.1
 log file parallel write       System I/O           3,485         47.4          13.6         2.5    log file parallel write       System I/O           1,881          3.2           1.7         0.2
 db file async I/O submit      System I/O           7,425          3.7           0.5         0.2    db file parallel write        System I/O           4,670          0.9           0.2         0.0
...

The IO times for the second interval (the EXT4 case) are twice as what they are on the first. But why? I thought at first that the PCIe card adds the extra latency. And I came up with a few other ideas that all remained that: ideas. Ideas are nice, but proof is better. Maybe there is indeed something odd? Or maybe there is something in the way that Oracle uses the device? To find out I used FIO. FIO is my favourite I/O benchmark when it comes to measuring raw performance. If the FIO benchmark results match the Oracle results then indeed there must be something wrong with the SSD or the way it is addressed. I spent a bit of time to compile FIO and ran a set of tests.

FIO

I was really quite curious to see what was happening during the execution of FIO-was the SATA3 SSD really slower? So I started FIO tests, using FIO 2.1.3. Using the same test harness, just the destination is different:

[~/fio-2.1.13]$ cat rand-read-sd*
[rand-read]
numjobs=4
rw=randread
directory=/u01/fio/data
size=5G
bs=8k
ioengine=libaio
buffered=0
iodepth=8

[rand-read]
numjobs=4
rw=randread
directory=/m/kvm/oracle
size=5G
bs=8k
ioengine=libaio
buffered=0
iodepth=8

[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
[~/fio-2.1.13]$

The /u01 file system sits on /dev/sda, and uses XFS. /m/kvm/oracle is on /dev/sde, using EXT4. Here is the output from the tests:

[~/fio-2.1.13]$ ./fio rand-read-sde.fio
rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
...
fio-2.1.13
Starting 4 processes
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 3 (f=3): [r(3),_(1)] [99.2% done] [192.8MB/0KB/0KB /s] [24.7K/0/0 iops] [eta 00m:01s]
rand-read: (groupid=0, jobs=1): err= 0: pid=6124: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=44174KB/s, iops=5521, runt=118686msec
    slat (usec): min=4, max=109, avg=13.06, stdev= 6.14
    clat (usec): min=128, max=43082, avg=1434.23, stdev=586.65
     lat (usec): min=140, max=43094, avg=1447.47, stdev=586.41
    clat percentiles (usec):
     |  1.00th=[  516],  5.00th=[  708], 10.00th=[  820], 20.00th=[  988],
     | 30.00th=[ 1112], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480],
     | 70.00th=[ 1608], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480],
     | 99.00th=[ 3184], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4576],
     | 99.99th=[ 6752]
    bw (KB  /s): min=27088, max=70928, per=24.99%, avg=44119.50, stdev=6314.88
    lat (usec) : 250=0.01%, 500=0.78%, 750=6.02%, 1000=14.12%
    lat (msec) : 2=65.54%, 4=13.37%, 10=0.15%, 20=0.01%, 50=0.01%
  cpu          : usr=1.89%, sys=11.05%, ctx=421189, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6125: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=44138KB/s, iops=5517, runt=118784msec
    slat (usec): min=4, max=108, avg=13.14, stdev= 6.14
    clat (usec): min=117, max=43252, avg=1435.31, stdev=587.66
     lat (usec): min=124, max=43264, avg=1448.64, stdev=587.40
    clat percentiles (usec):
     |  1.00th=[  502],  5.00th=[  700], 10.00th=[  820], 20.00th=[  988],
     | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1480],
     | 70.00th=[ 1624], 80.00th=[ 1816], 90.00th=[ 2160], 95.00th=[ 2480],
     | 99.00th=[ 3152], 99.50th=[ 3472], 99.90th=[ 4192], 99.95th=[ 4640],
     | 99.99th=[ 6688]
    bw (KB  /s): min=27504, max=69760, per=24.91%, avg=43972.22, stdev=6169.69
    lat (usec) : 250=0.07%, 500=0.92%, 750=5.84%, 1000=13.94%
    lat (msec) : 2=65.68%, 4=13.40%, 10=0.16%, 20=0.01%, 50=0.01%
  cpu          : usr=1.89%, sys=11.10%, ctx=421379, majf=0, minf=46
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6126: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=44230KB/s, iops=5528, runt=118538msec
    slat (usec): min=4, max=120, avg=12.59, stdev= 5.61
    clat (usec): min=185, max=42871, avg=1433.09, stdev=581.75
     lat (usec): min=196, max=42881, avg=1445.85, stdev=581.45
    clat percentiles (usec):
     |  1.00th=[  532],  5.00th=[  716], 10.00th=[  828], 20.00th=[  988],
     | 30.00th=[ 1128], 40.00th=[ 1240], 50.00th=[ 1352], 60.00th=[ 1464],
     | 70.00th=[ 1608], 80.00th=[ 1800], 90.00th=[ 2128], 95.00th=[ 2480],
     | 99.00th=[ 3152], 99.50th=[ 3440], 99.90th=[ 4192], 99.95th=[ 4576],
     | 99.99th=[ 7008]
    bw (KB  /s): min=27600, max=65952, per=25.01%, avg=44149.65, stdev=5945.98
    lat (usec) : 250=0.01%, 500=0.63%, 750=5.90%, 1000=14.17%
    lat (msec) : 2=65.89%, 4=13.25%, 10=0.15%, 20=0.01%, 50=0.01%
  cpu          : usr=1.74%, sys=11.20%, ctx=426219, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6127: Thu Oct 30 15:41:15 2014
  read : io=5120.0MB, bw=47359KB/s, iops=5919, runt=110704msec
    slat (usec): min=4, max=114, avg=12.83, stdev= 5.90
    clat (usec): min=188, max=42988, avg=1337.06, stdev=571.06
     lat (usec): min=198, max=43000, avg=1350.07, stdev=570.63
    clat percentiles (usec):
     |  1.00th=[  446],  5.00th=[  620], 10.00th=[  748], 20.00th=[  900],
     | 30.00th=[ 1020], 40.00th=[ 1144], 50.00th=[ 1256], 60.00th=[ 1368],
     | 70.00th=[ 1512], 80.00th=[ 1704], 90.00th=[ 2040], 95.00th=[ 2352],
     | 99.00th=[ 3024], 99.50th=[ 3280], 99.90th=[ 3984], 99.95th=[ 4384],
     | 99.99th=[ 6624]
    bw (KB  /s): min=33664, max=51008, per=26.83%, avg=47369.34, stdev=4051.34
    lat (usec) : 250=0.01%, 500=1.45%, 750=8.72%, 1000=17.95%
    lat (msec) : 2=61.12%, 4=10.65%, 10=0.09%, 20=0.01%, 50=0.01%
  cpu          : usr=1.98%, sys=11.60%, ctx=410433, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: io=20480MB, aggrb=176551KB/s, minb=44137KB/s, maxb=47359KB/s, mint=110704msec, maxt=118784msec

Disk stats (read/write):
  sde: ios=2619049/4, merge=0/1, ticks=3663514/6, in_queue=3662827, util=100.00%

The best tool to look at IO without tracing block devices is collectl (in my opinion). I ran the extended disk stats in parallel:

[root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sde
waiting for 5 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sde              0      0    0    0  125340    129  123 1021    1020   219  1749      8   99
sde              0      0    0    0  119608     58  118 1017    1017   156  1666      7   88
sde         146968      0  18K    8       0      0    0    0       8    31     1      0   89
sde         165435      0  20K    8       2      0    0    6       7    31     1      0   99
sde         187097      0  23K    8       0      0    0    0       8    31     1      0  100
sde         187113      0  23K    8       0      0    0    0       8    31     1      0   99
sde         157684      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186542      0  23K    8       1      0    0    4       7    31     1      0  100
sde         186721      0  23K    8       0      0    0    0       8    31     1      0   99
sde         158416      0  19K    8       0      0    0    0       8    31     1      0  100
sde         187168      0  23K    8       0      0    0    0       8    31     1      0   99
sde         186027      0  23K    8       0      0    0    0       8    31     1      0  100
sde         158366      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186028      0  23K    8       0      0    0    0       8    31     1      0   99
sde         186416      0  23K    8       0      0    0    0       8    31     1      0  100
sde         159102      0  19K    8       0      0    0    0       8    31     1      0  100
sde         186348      0  23K    8       0      0    0    0       8    31     1      0  100
sde         186859      0  23K    8       0      0    0    0       8    31     1      0   99
sde         153996      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186043      0  23K    8       0      0    0    0       8    31     1      0  100
sde         185348      0  23K    8       0      0    0    0       8    31     1      0   99
sde         158150      0  19K    8       0      0    0    0       8    31     1      0   99
sde         186148      0  23K    8       0      0    0    0       8    31     1      0   99
sde         185900      0  23K    8       0      0    0    0       8    31     1      0   99
sde         159353      0  19K    8       0      0    0    0       8    25     1      0   99
sde         167065      0  20K    8       0      0    0    0       8    22     0      0   86
sde              0      0    0    0       0      0    0    0       0     0     0      0    0

This matches the FIO output-and indeed, the device can do around 23k IOPS with an 8k IO size. What about the XFS mount?

[~/fio-2.1.13]$ ./fio rand-read-sda.fio
rand-read: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
...
fio-2.1.13
Starting 4 processes
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
rand-read: Laying out IO file(s) (1 file(s) / 5120MB)
Jobs: 4 (f=4): [r(4)] [100.0% done] [191.3MB/0KB/0KB /s] [24.5K/0/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=1): err= 0: pid=6329: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49087KB/s, iops=6135, runt=106809msec
    slat (usec): min=3, max=93, avg= 7.90, stdev= 6.40
    clat (usec): min=280, max=35113, avg=1294.26, stdev=152.91
     lat (usec): min=285, max=35130, avg=1302.35, stdev=152.60
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1240],
     | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224],
     | 99.99th=[ 3024]
    bw (KB  /s): min=46000, max=49536, per=25.02%, avg=49124.10, stdev=291.98
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.03%
    lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.39%, sys=8.05%, ctx=505195, majf=0, minf=43
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6330: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49093KB/s, iops=6136, runt=106795msec
    slat (usec): min=3, max=114, avg= 7.55, stdev= 5.74
    clat (usec): min=263, max=35353, avg=1294.46, stdev=155.68
     lat (usec): min=271, max=35369, avg=1302.21, stdev=155.46
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1192], 20.00th=[ 1256],
     | 30.00th=[ 1272], 40.00th=[ 1288], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1672], 99.95th=[ 2224],
     | 99.99th=[ 3184]
    bw (KB  /s): min=45728, max=49472, per=25.02%, avg=49123.20, stdev=300.30
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02%
    lat (msec) : 2=99.89%, 4=0.07%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.42%, sys=7.88%, ctx=549312, majf=0, minf=46
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6331: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec
    slat (usec): min=3, max=137, avg= 7.70, stdev= 5.87
    clat (usec): min=266, max=35102, avg=1294.64, stdev=147.64
     lat (usec): min=274, max=35121, avg=1302.52, stdev=147.39
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256],
     | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 2064], 99.95th=[ 2256],
     | 99.99th=[ 3600]
    bw (KB  /s): min=45888, max=49536, per=25.02%, avg=49110.62, stdev=295.81
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.04%
    lat (msec) : 2=99.84%, 4=0.10%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.32%, sys=8.07%, ctx=536841, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
rand-read: (groupid=0, jobs=1): err= 0: pid=6332: Thu Oct 30 15:52:03 2014
  read : io=5120.0MB, bw=49080KB/s, iops=6134, runt=106824msec
    slat (usec): min=3, max=106, avg= 7.66, stdev= 5.93
    clat (usec): min=274, max=35268, avg=1294.67, stdev=157.16
     lat (usec): min=281, max=35286, avg=1302.53, stdev=156.91
    clat percentiles (usec):
     |  1.00th=[ 1112],  5.00th=[ 1160], 10.00th=[ 1208], 20.00th=[ 1256],
     | 30.00th=[ 1272], 40.00th=[ 1272], 50.00th=[ 1288], 60.00th=[ 1304],
     | 70.00th=[ 1320], 80.00th=[ 1352], 90.00th=[ 1384], 95.00th=[ 1416],
     | 99.00th=[ 1480], 99.50th=[ 1512], 99.90th=[ 1944], 99.95th=[ 2224],
     | 99.99th=[ 3536]
    bw (KB  /s): min=45840, max=49536, per=25.02%, avg=49109.56, stdev=303.79
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.05%
    lat (msec) : 2=99.84%, 4=0.09%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.50%, sys=7.83%, ctx=533469, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=655360/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: io=20480MB, aggrb=196318KB/s, minb=49079KB/s, maxb=49092KB/s, mint=106795msec, maxt=106824msec

Disk stats (read/write):
  sda: ios=2620077/60, merge=0/2, ticks=3348979/94, in_queue=3348631, util=100.00%
[oracle@ol62 - /u01/app/oracle/product/12.1.0.2/dbhome_1: SLOB12]
[~/fio-2.1.13]$

Here is some of the corresponding collectl output:

[root@ol62 queue]# collectl -sD -i 5 --iosize --dskfilt sda
waiting for 5 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sda         196313      0  24K    8       5      0    1    9       8    31     1      0   99
sda         196681      0  24K    8       6      0    0   16       8    31     1      0  100
sda         196369      0  24K    8       7      0    1   12       8    31     1      0   99
sda         196217      0  24K    8       4      0    0   10       8    31     1      0  100
sda         196300      0  24K    8      14      0    2    6       8    31     1      0  100
sda         196699      0  24K    8       6      0    0   16       8    31     1      0   99
sda         196340      0  24K    8       3      0    0   16       8    31     1      0   99
sda         196350      0  24K    8       6      0    0   16       8    31     1      0  100
sda         196379      0  24K    8       6      0    0   16       8    31     1      0   99
sda         196560      0  24K    8       3      0    0   16       8    31     1      0   99
sda         196259      0  24K    8       8      0    1    8       8    31     1      0   99

Summary Number 1

So it would appear that the SSDs are more or less equal from FIO’s point of view. If you followed closely (hurray if you did-I appreciate this is a long post!) you will have noticed the following:

  • The SLOB test used updates (20%). There is furthermore just one data file for the IOPS tablespace (-> there is concurrency!)
  • The FIO test was based on random reads, no writes. It’s 4 jobs also worked on 4 distinct files.

I have tried to avoid the effect of file system buffering by setting buffer to 0 and I’m also using asynchronous IO as provided by libaio (just as Oracle does).

Theory Number 2

It would appear that XFS is better suited for concurrent writes. Let’s put that theory into action. I reformatted the SATA3 SSD with XFS and ran another test. Feast your eyes-comparing the AWR reports of the EXT4 test execution to XFS on /dev/sde:

WORKLOAD REPOSITORY COMPARE PERIOD REPORT

Snapshot Set    DB Id    Instance     Inst Num Release     Cluster Host          Std Block Size
------------ ----------- ------------ -------- ----------- ------- ------------ ---------------
First (1st)   1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192
Second (2nd)  1269760792 SLOB12              1 12.1.0.2.0  NO      ol62.localdo     8192

Snapshot Set  Begin Snap Id Begin Snap Time            End Snap Id End Snap Time                  Avg Active Users           Elapsed Time (min)            DB time (min)
------------ -------------- ------------------------- ------------ ------------------------- -------------------------- -------------------------- --------------------------
1st                      49 31-Oct-14 12:38:18 (Fri)           50 31-Oct-14 12:40:20 (Fri)                      15.8                       2.0                      32.3
2nd                      55 31-Oct-14 15:35:23 (Fri)           56 31-Oct-14 15:37:25 (Fri)                      15.7                       2.0                      32.1
                                                                                      ~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                                                      %Diff:                     0.2%                      -0.5%                      -0.6%

Host Configuration Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                     1st                  2nd                 Diff     %Diff
----------------------------------- -------------------- -------------------- -------------------- ---------
Number of CPUs:                                       24                   24                    0       0.0
Number of CPU Cores:                                  24                   24                    0       0.0
Number of CPU Sockets:                                 2                    2                    0       0.0
Physical Memory:                                64413.7M             64413.7M                   0M       0.0
Load at Start Snapshot:                             9.57                 1.77                 -7.8     -81.5
Load at End Snapshot:                               15.6                14.15                -1.45      -9.3
%User Time:                                          .64                 2.07                 1.43     223.4
%System Time:                                        .74                 1.48                  .75     100.0
%Idle Time:                                         98.6                 96.4                 -2.2      -2.2
%IO Wait Time:                                      9.06                53.29                44.23     488.2
Cache Sizes
~~~~~~~~~~~
                          1st (M)    2nd (M)   Diff (M)    %Diff
---------------------- ---------- ---------- ---------- --------
Memory Target
.....SGA Target
..........Buffer Cache       16.0       16.0        0.0      0.0
..........Shared Pool     1,024.0    1,024.0        0.0      0.0
..........Large Pool
..........Java Pool          16.0       16.0        0.0      0.0
..........Streams Pool
.....PGA Target           1,024.0    1,024.0        0.0      0.0
Log Buffer                  128.0      128.0        0.0      0.0
In-Memory Area

Workload Comparison
~~~~~~~~~~~~~~~~~~~                      1st Per Sec          2nd Per Sec      %Diff              1st Per Txn          2nd Per Txn      %Diff
                                     ---------------      ---------------     ------          ---------------      ---------------     ------
                       DB time:                 15.8                 15.8        0.2                      4.4                  1.5      -66.7
                      CPU time:                  0.3                  0.9      175.8                      0.1                  0.1        0.0
           Background CPU time:                  0.1                  0.1      133.3                      0.0                  0.0      -50.0
             Redo size (bytes):          4,652,147.0         15,186,494.1      226.4              1,311,306.6          1,426,136.0        8.8
         Logical read (blocks):              5,231.1             16,957.8      224.2                  1,474.5              1,592.5        8.0
                 Block changes:              2,640.2              8,922.1      237.9                    744.2                837.9       12.6
        Physical read (blocks):              3,750.0             12,560.9      235.0                  1,057.0              1,179.6       11.6
       Physical write (blocks):              1,729.0              5,855.0      238.6                    487.4                549.8       12.8
              Read IO requests:              3,749.3             12,560.8      235.0                  1,056.8              1,179.6       11.6
             Write IO requests:              1,538.2              5,219.5      239.3                    433.6                490.2       13.0
                  Read IO (MB):                 29.3                 98.1      234.9                      8.3                  9.2       11.6
                 Write IO (MB):                 13.5                 45.7      238.6                      3.8                  4.3       12.9
                  IM scan rows:                  0.0                  0.0        0.0                      0.0                  0.0        0.0
       Session Logical Read IM:
                    User calls:                  1.4                  1.4        2.9                      0.4                  0.1      -66.7
                  Parses (SQL):                  6.3                 10.3       62.7                      1.8                  1.0      -45.5
             Hard parses (SQL):                  0.3                  0.0      -86.2                      0.1                  0.0     -100.0
            SQL Work Area (MB):                  0.4                  0.2      -48.8                      0.1                  0.0      -48.8
                        Logons:                  0.1                  0.1        0.0                      0.0                  0.0      -50.0
                Executes (SQL):                 23.7                 58.0      144.3                      6.7                  5.4      -18.5
                  Transactions:                  3.5                 10.6      200.0

                                               First               Second       Diff
                                     ---------------      ---------------     ------
     % Blocks changed per Read:                 50.5                 52.6        2.1
              Recursive Call %:                 97.6                 99.0        1.5
    Rollback per transaction %:                  0.0                  0.0        0.0
                 Rows per Sort:                 35.4                 49.6       14.2
    Avg DB time per Call (sec):                 11.5                 11.3       -0.3

Top Timed Events   First DB/Inst: SLOB12/SLOB12 Snaps: 49-50 (Elapsed time: 122.614 sec  DB time: 1935.33 sec),  Second DB/Inst: SLOB12/SLOB12 Snaps: 55-56 (Elapsed time: 121.705 sec  DB time: 1923.77 sec)
-> Events with a "-" did not make the Top list in this set of snapshots, but are displayed for comparison purposes

                                               1st                                                                                                2nd
------------------------------------------------------------------------------------------------   ------------------------------------------------------------------------------------------------
Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time   Event                          Wait Class           Waits      Time(s)  Avg Time(ms)    %DB time
------------------------------ ------------- ------------ ------------ ------------- -----------   ------------------------------ ------------- ------------ ------------ ------------- -----------
 db file sequential read       User I/O           459,637      1,920.9           4.2        99.3    db file sequential read       User I/O         1,528,725      1,840.4           1.2        95.7
 db file async I/O submit      System I/O           2,975         97.2          32.7         5.0    CPU time                                             N/A        110.3           N/A         5.7
 CPU time                                             N/A         40.2           N/A         2.1    db file parallel write        System I/O         114,997         34.3           0.3         1.8
 log file parallel write       System I/O           1,881          3.2           1.7         0.2    free buffer waits             Configuration        2,318         28.0          12.1         1.5
 db file parallel write        System I/O           4,670          0.9           0.2         0.0    log file parallel write       System I/O           6,897         14.3           2.1         0.7
 control file sequential read  System I/O             499          0.2           0.3         0.0    db file async I/O submit      System I/O          12,819          5.2           0.4         0.3
 oracle thread bootstrap       Other                    7          0.1          18.1         0.0    buffer busy waits             Concurrency            407          2.7           6.7         0.1
 Disk file operations I/O      User I/O                77          0.1           0.9         0.0    Data file init write          User I/O               152          0.9           5.7         0.0
 library cache: mutex X        Concurrency             22          0.1           2.7         0.0    read by other session         User I/O                60          0.7          11.9         0.0
 Data file init write          User I/O                 5          0.0           5.6         0.0    enq: HW - contention          Configuration           18          0.7          39.3         0.0
-read by other session         User I/O                18          0.0           0.6         0.0   -control file sequential read  System I/O             841          0.5           0.5         0.0
-buffer busy waits             Concurrency             82          0.0           0.0         0.0   -oracle thread bootstrap       Other                    7          0.1          17.2         0.0
-                                                     N/A          N/A           N/A         N/A   -Disk file operations I/O      User I/O               136          0.1           0.8         0.0
-                                                     N/A          N/A           N/A         N/A   -library cache: mutex X        Concurrency             21          0.0           1.4         0.0
                          --------------------------------------------------------------------------------------------------------------------

Everything is just better – the IO times, the redo size, the number of IOPS etc. That’s what I would have expected from the outset. Good to know-I’ll no longer use EXT4 for Oracle databases. After discussing this with @fritshoogland he showed me a post by @kevinclosson that didn’t come to mind at the time:

Yes, File Systems Still Need To Support Concurrent Writes! Yet Another Look At XFS Versus EXT4.

Exadata and Virtual Private Database: will it offload my query?

During one of the classes I taught about Exadata optimisations I had an interesting question:

If I am using VPD, will Exadata still offload the query?

Background is that we discussed function offloading, and the meta-view v$sqlfn_metadata. It turned out that SYS_CONTEXT() is not offloadable in 11.2.0.4.

SQL> select name,datatype,analytic,aggregate,offloadable,descr
  2  from v$sqlfn_metadata where name = 'SYS_CONTEXT';

NAME                                               DATATYPE ANA AGG OFF DESCR
-------------------------------------------------- -------- --- --- --- ------------------------------
SYS_CONTEXT                                        UNKNOWN  NO  NO  NO  SYS_CONTEXT

Since I’m a great fan of the Tom Kyte method (don’t say it is so, show it!) I needed a quick example. Of all the Oracle books I read “Effective Oracle by Design” was among the most inspiring.

Where to check?

My first idea was to check v$sqlfn_metadata to see if the ever present SYS_CONTEXT() was offloadable:

SQL> select name,offloadable
  2  from v$sqlfn_metadata
  3  where name = 'SYS_CONTEXT';

NAME                    OFF
----------------------- ---
SYS_CONTEXT             NO

SYS:dbm011> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE	11.2.0.4.0	Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production

OK so it does not appear to be offloadable. Does it matter? As you will see, it does not although it could startle you at first.

Reproduce a test case: 11.2.0.4/11.2.3.3.1.140529.1

After finishing the example from the Oracle documentation I decided that I needed a few more rows in the table to get to a more realistic data distribution for an Exadata system. You wouldn’t see a smart scan on a row with < 10 rows. This can be done quite easily, and the end result was:

SYS> select count(*) from scott.orders_tab;

   COUNT(*)
-----------
   12000003

1 row selected.

I had to ensure that the query is offloaded first-simplez! Kerry Osborne has a script for this:

SYS> select /*+ gather_plan_statistics rowcount_sys_stats */
  2  count(*) from scott.orders_tab;

   COUNT(*)
-----------
   12000003

1 row selected.

Elapsed: 00:00:00.81

SYS:OGGSRC> @scripts/fsx
Enter value for sql_text: %rowcount_sys_stats%
Enter value for sql_id:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
8bv5b04mjku08	   0  1093340548      1        .81	0 Yes	       38.47 select /*+ gather_plan_statistics rowcount_sys_stats */ count(*) from

1 row selected.

SYS:OGGSRC> @scripts/dplan
Enter value for sql_id: 8bv5b04mjku08
Enter value for child_no:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	8bv5b04mjku08, child number 0
-------------------------------------
select /*+ gather_plan_statistics rowcount_sys_stats */ count(*) from
scott.orders_tab

Plan hash value: 1093340548

---------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |       |  7521 (100)|          |
|   1 |  SORT AGGREGATE            |            |     1 |            |          |
|   2 |   TABLE ACCESS STORAGE FULL| ORDERS_TAB |    12M|  7521   (1)| 00:01:31 |
---------------------------------------------------------------------------------

15 rows selected.

So I guess it is.

Enter VPD

With all the VPD bells and whistles turned on I connected as on the the users for which there is an explicit context set and ran my query:

TBROOKE> select /*+ gather_plan_statistics tbrooke_query_001 */
  2  count(*) from scott.orders_tab;

   COUNT(*)
-----------
          4

1 row selected.

Elapsed: 00:00:00.15

1 row selected.

Which shows that the VPD policy works. What do my tools indicate?

SYS> @scripts/fsx
Enter value for sql_text: %tbrooke_query_001%
Enter value for sql_id:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
bs9gq0bdqazzu	   0  1093340548      1        .15	0 Yes	       99.98 select /*+ gather_plan_statistics tbrooke_query_001 */ count(*) from s

1 row selected.

SYS> @scripts/dplan
Enter value for sql_id: bs9gq0bdqazzu
Enter value for child_no: 0

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	bs9gq0bdqazzu, child number 0
-------------------------------------
select /*+ gather_plan_statistics tbrooke_query_001 */ count(*) from
scott.orders_tab

Plan hash value: 1093340548

-----------------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |       |       |  7651 (100)|          |
|   1 |  SORT AGGREGATE            |            |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| ORDERS_TAB |     1 |     6 |  7651   (3)| 00:01:32 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))
       filter("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))

21 rows selected.

So using the script it is easily visible that a smart scan has happened and in fact it saved 99.98% of IO. Which is not surprising giving that only 4 rows out of the whole result set have been returned. The Real Time SQL Monitor Report confirmed the finding by the way. I had to sneak in another hint (+monitor) otherwise the statement wouldn’t be captured in SQL Monitor (only “long running” statements are captured by default)

SQL Monitoring Report

SQL Text
------------------------------
select /*+ gather_plan_statistics monitor tbrooke_query_002 */ count(*) from scott.orders_tab

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TBROOKE (33:239)
 SQL ID              :  8ydqam3fuwt2z
 SQL Execution ID    :  16777216
 Execution Started   :  10/14/2014 04:28:24
 First Refresh Time  :  10/14/2014 04:28:24
 Last Refresh Time   :  10/14/2014 04:28:24
 Duration            :  .100139s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb01.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1

Global Stats
====================================================================================
| Elapsed |   Cpu   |    IO    | PL/SQL  | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) | Time(s) | Calls |  Gets  | Reqs | Bytes | Offload |
====================================================================================
|    0.10 |    0.01 |     0.09 |    0.00 |     1 |  27352 |  228 | 214MB |  99.98% |
====================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1093340548)
=========================================================================================================================================================================
| Id |          Operation           |    Name    |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   |  Mem  | Activity | Activity Detail |
|    |                              |            | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload | (Max) |   (%)    |   (# samples)   |
=========================================================================================================================================================================
|  0 | SELECT STATEMENT             |            |         |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  1 |   SORT AGGREGATE             |            |       1 |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  2 |    TABLE ACCESS STORAGE FULL | ORDERS_TAB |       1 | 7651 |         1 |     +0 |     1 |        4 |  228 | 214MB |  99.98% |    2M |          |                 |
=========================================================================================================================================================================

This confirms that a full table scan happened, and it must have been executed as a direct path read. A DPR on Exadata most oftent transforms into a smart scan. As you can see I didn’t specify any other predicate, and yet the VPD was offloaded.

Reproduce a test case: 12.1.0.2.0/12.1.1.1.1.140712

Interestingly in 12.1.0.2.0 the function SYS_CONTEXT is offloadable:

SQL> select name, offloadable
  2  from v$sqlfn_metadata where name = 'SYS_CONTEXT';

NAME                           OFF
------------------------------ ---
SYS_CONTEXT                    YES

1 row selected.

For completeness sake I have repeated my test here. I copied the table via network link and created the same users and VPD. Not too much of a difference. Information is shown here without comments:

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE    12.1.0.2.0      Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production

TBROOKE> select /*+ gather_plan_statistics tbrooke_query_013 */ count(*) from scott.orders_tab;

  COUNT(*)
----------
         4

Elapsed: 00:00:00.02

SQL> @scripts/fsx
Enter value for sql_text: %tbrooke_query_013%
Enter value for sql_id:

SQL_ID         CHILD  PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
bf0s5hzr7x9r5      0 1093340548      1        .02      0 Yes          99.97 select /*+ gather_plan_statistics tbrooke_query_013 */ count(*) from s

1 row selected.

SQL> @scripts/dplan
Enter value for sql_id: bf0s5hzr7x9r5
Enter value for child_no: 0

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  bf0s5hzr7x9r5, child number 0
-------------------------------------
select /*+ gather_plan_statistics tbrooke_query_013 */ count(*) from
scott.orders_tab

Plan hash value: 1093340548

-----------------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |       |       |     2 (100)|          |
|   1 |  SORT AGGREGATE            |            |     1 |    13 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| ORDERS_TAB |     1 |    13 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))
       filter("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))

21 rows selected.

SQL Monitoring Report

SQL Text
------------------------------
select /*+ gather_plan_statistics monitor tbrooke_query_014 */ count(*) from scott.orders_tab

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TBROOKE (1042:32766)
 SQL ID              :  55yy67scgw2sf
 SQL Execution ID    :  16777216
 Execution Started   :  10/14/2014 05:13:30
 First Refresh Time  :  10/14/2014 05:13:30
 Last Refresh Time   :  10/14/2014 05:13:30
 Duration            :  .016025s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1

Global Stats
==================================================================================================
| Elapsed |   Cpu   |    IO    | Application | PL/SQL  | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Time(s) | Calls |  Gets  | Reqs | Bytes | Offload |
==================================================================================================
|    0.02 |    0.01 |     0.01 |        0.00 |    0.00 |     1 |  27441 |  223 | 214MB |  99.99% |
==================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1093340548)
=========================================================================================================================================================================
| Id |          Operation           |    Name    |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   |  Mem  | Activity | Activity Detail |
|    |                              |            | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload | (Max) |   (%)    |   (# samples)   |
=========================================================================================================================================================================
|  0 | SELECT STATEMENT             |            |         |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  1 |   SORT AGGREGATE             |            |       1 |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  2 |    TABLE ACCESS STORAGE FULL | ORDERS_TAB |       1 |    2 |         1 |     +0 |     1 |        4 |  223 | 214MB |  99.99% |    3M |          |                 |
=========================================================================================================================================================================

In the last example flash cache and storage indexes attributed greatly to the quick execution time. Using Adrian Billington’s mystats I can see more detail. I have removed what’s not needed from the report.

SQL> @scripts/mystats stop t=1

==========================================================================================
MyStats report : 14-OCT-2014 05:27:44
==========================================================================================

------------------------------------------------------------------------------------------
1. Summary Timings
------------------------------------------------------------------------------------------

Type    Statistic Name                                                               Value
------  ----------------------------------------------------------------  ----------------
TIMER   snapshot interval (seconds)                                                  13.59
TIMER   CPU time used (seconds)                                                       0.01

------------------------------------------------------------------------------------------
2. Statistics Report
------------------------------------------------------------------------------------------

Type    Statistic Name                                                               Value
------  ----------------------------------------------------------------  ----------------
STAT    CPU used by this session                                                         7
STAT    CPU used when call started                                                       7
STAT    DB time                                                                          9
...
STAT    cell IO uncompressed bytes                                               3,031,040
STAT    cell blocks helped by minscn optimization                                      370
STAT    cell blocks processed by cache layer                                           370
STAT    cell blocks processed by data layer                                            370
STAT    cell blocks processed by txn layer                                             370
STAT    cell flash cache read hits                                                      10
STAT    cell num smartio automem buffer allocation attempts                              1
STAT    cell physical IO bytes eligible for predicate offload                  224,763,904
STAT    cell physical IO bytes saved by storage index                          221,732,864
STAT    cell physical IO interconnect bytes                                          5,360
STAT    cell physical IO interconnect bytes returned by smart scan                   5,360
STAT    cell scans                                                                       1
...
STAT    consistent gets                                                             27,816
STAT    consistent gets direct                                                      27,437
STAT    consistent gets examination                                                      3
STAT    consistent gets examination (fastpath)                                           3
STAT    consistent gets from cache                                                     379
STAT    consistent gets pin                                                            376
STAT    consistent gets pin (fastpath)                                                 376
STAT    db block changes                                                               787
STAT    db block gets                                                                2,879
STAT    db block gets from cache                                                     2,879
...
STAT    logical read bytes from cache                                           26,689,536
...
STAT    physical read bytes                                                    224,763,904
STAT    physical read requests optimized                                               223
STAT    physical read total IO requests                                                223
STAT    physical read total bytes                                              224,763,904
STAT    physical read total bytes optimized                                    224,763,904
STAT    physical read total multi block requests                                       215
STAT    physical reads                                                              27,437
STAT    physical reads direct                                                       27,437
...
STAT    table fetch by rowid                                                             1
STAT    table scan blocks gotten                                                         3
STAT    table scan disk non-IMC rows gotten                                          1,685
STAT    table scan rows gotten                                                       1,685
STAT    table scans (direct read)                                                        1
STAT    table scans (long tables)                                                        1
...

------------------------------------------------------------------------------------------
3. About
------------------------------------------------------------------------------------------
- MyStats v2.01 by Adrian Billington (http://www.oracle-developer.net)
- Based on the SNAP_MY_STATS utility by Jonathan Lewis

==========================================================================================
End of report
==========================================================================================

In the case of VPD the fact that SYS_CONTEXT() is offloadable did not play a major role.

Little things worth knowing-troubleshooting parallel statement queueing

This is yet another one of these posts that hopefully help you as much as they are going to help me in the future. Recently I enjoyed troubleshooting a problem related to parallel execution. Since I have never really written down how to tackle such a problem I thought it might be nice to do that now.

This is 12.1.0.2.0 on Exadata, but the platform doesn’t really matter for the troubleshooting technique.

What is parallel statement queueing

Statement queueing is a pretty nifty tool new with 11.2. Instead of downgrading the requested Degree of Parallelism (DOP) as it was the case with parallel_adaptive_multi_user your session will queue until the requested number of parallel servers becomes available. The rationale behind this feature is that you might end up queueing for 1 minute until all the requested PX servers become available, but once they are available your session can finish in 2 minutes with a DOP of 16 rather than 8 minutes with a (old behaviour) downgraded DOP of 2 for example. The 1 minute was well spent waiting. If you want to fine tune statement queueing, database resource manager allows you to.

The downside is that you don’t get statement queueing without enabling automatic DOP unless you play with underscore parameters. Automatic DOP is quite often the reason why you see statement queueing in the first place.

How does this statement queueing work?

With parallel_degree_policy set to AUTO (warning again-this can be dangerous!) the optimiser estimates the time it takes to execute a SQL statement during the hard parse. If the estimated execution time exceeds parallel_min_time_threshold (default is AUTO which is 10 seconds) then the optimiser will try to work out the “optimal” DOP which is calculated as min(parallel_degree_limit,  calculated ideal DOP). The default parallel_degree_limit is set to CPU, which is documented as follows: “The maximum degree of parallelism is limited by the number of CPUs in the system”. Ooops-that can be a lot. It might be better to set the parallel_degree_limit to an integer, signifying the maximum allowed DOP for a statement executing in parallel. On Exadata you don’t normally need very high degrees of parallelism, so a more conservative settings might be in order.

Queueing

When does the queueing mechanism kick in? Let’s imagine that you have parallel_max_servers = parallel_servers_target = 48 defined in your system, in my case an Exadata X2 with 2s12c24t Westmere EP processors. This isn’t the default by the way. This setting allows you to execute queries 48-way parallel per instance. Let’s assume that 40 query slaves are in use already, and another user wants to execute a query 12-way parallel. The sum of currently executing parallel slaves + the requested number of slaves exceeds parallel_servers_target, triggering statement queueing for your session.

Wolfgang it!

To create a reproducible test case I set parallel_max_servers to 48 and parallel_servers_target to 24. I also enabled automatic DOP in my session. My test queries are directed at a 256 million rows table of approximately 80GB in size, non partitioned. Before executing the first SQL statement I enabled event 10053 to Wolfgang my query.

SQL> alter session set events '10053 trace name context forever, level 1';

Session altered.

Elapsed: 00:00:00.02
SQL> desc bigtab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                                 NUMBER
 V1                                                 VARCHAR2(40)
 N1                                                 NUMBER
 N2                                                 NUMBER
 N_256K                                             NUMBER
 N_128K                                             NUMBER
 N_8K                                               NUMBER
 PADDING                                            VARCHAR2(960)

SQL> select 1 + sum(n1) + sum (n2) - sum(n_8k) from bigtab where id > 1;

1+SUM(N1)+SUM(N2)-SUM(N_8K)
---------------------------
                 1.0471E+13

Elapsed: 00:00:35.59

The 10053 trace should give me some idea about the DOP. Let’s have a look at it:

     36 **************************
     37 Automatic degree of parallelism (AUTODOP)
     38 **************************
     39 Automatic degree of parallelism is enabled for this statement in auto mode.
     40 kkopqSetForceParallelProperties: Hint:no
     41 Query: compute:yes forced:no forceDop:0
     42 Calibration statistics is enabled.
     43 Start with a serial pass, cost the DOP to use

   1935 AUTO DOP PLANS EVALUATION
   1936 ***************************************
   1937 Compilation completed with Dop: 1.
   1938   Cost_io:  2776823.000000  Cost_cpu: 154935265907
   1939   Card:     1.000000  Bytes:    19.000000
   1940   Cost:     2781428.613003  Est_time:  401168ms
   1941 kkopqCombineDop: Dop:40 Hint:no
   1942 Query: compute:yes forced:no  scanDop:40 cpuDop:1 forceDop:0
   1943 Serial plan is expensive enough to be a candidate for parallelism (2781429)
   1944 AutoDOP: kkopqAdjustDop: dop is 40 after affinity normalization
   1945 Signal reparse with DOP 40.
   1946 *****************************
   1947 Number of Compilations tried: 1
   1948 *****************************

   3853 AUTO DOP PLANS EVALUATION
   3854 ***************************************
   3855 Compilation completed with Dop: 40.
   3856   Cost_io:  77133.972222  Cost_cpu: 2387701756
   3857   Card:     1.000000  Bytes:    19.000000
   3858   Cost:     77204.949157  Est_time:  11135ms
   3859 kkopqCombineDop: Dop:80 Hint:no
   3860 Query: compute:yes forced:no  scanDop:40 cpuDop:80 forceDop:0
   3861 AutoDOP: kkopqAdjustDop: computed dop before applying degree limit was 80 and now it is 48
   3862 AutoDOP: kkopqAdjustDop: dop is 48 after affinity normalization
   3863 Adopt new plan with dop=40
   3864 Signal reparse with DOP 48.
   3865 *****************************
   3866 Number of Compilations tried: 2
   3867 *****************************

   5772 AUTO DOP PLANS EVALUATION
   5773 ***************************************
   5774 Compilation completed with Dop: 48.
   5775   Cost_io:  64278.310185  Cost_cpu: 1989751463
   5776   Card:     1.000000  Bytes:    19.000000
   5777   Cost:     64337.457631  Est_time:  9279ms
   5778 kkopqCombineDop: Dop:80 Hint:no
   5779 Query: compute:yes forced:no  scanDop:40 cpuDop:80 forceDop:0
   5780 Adopt new plan with dop=48
   5781 kkopqSetDopReason: Reason why we chose this DOP is: degree limit.
   5782 Costing completed. DOP chosen: 48.
   5783 AutoDop: kkopqSetMaxDopInCursorEnv:In the Cursor Session Env, max DOP is set to 48
   5784 *****************************
   5785 Number of Compilations tried: 3
   5786 *****************************

   5811 ============
   5812 Plan Table
   5813 ============
   5814 --------------------------------------------------+-----------------------------------+-------------------------+
   5815 | Id  | Operation                       | Name    | Rows  | Bytes | Cost  | Time      |  TQ  |IN-OUT|PQ Distrib |
   5816 --------------------------------------------------+-----------------------------------+-------------------------+
   5817 | 0   | SELECT STATEMENT                |         |       |       |   63K |           |      |      |           |
   5818 | 1   |  SORT AGGREGATE                 |         |     1 |    19 |       |           |      |      |           |
   5819 | 2   |   PX COORDINATOR                |         |       |       |       |           |      |      |           |
   5820 | 3   |    PX SEND QC (RANDOM)          | :TQ10000|     1 |    19 |       |           |:Q1000| P->S |QC (RANDOM)|
   5821 | 4   |     SORT AGGREGATE              |         |     1 |    19 |       |           |:Q1000| PCWP |           |
   5822 | 5   |      PX BLOCK ITERATOR          |         |  276M | 5248M |   63K |  00:02:09 |:Q1000| PCWC |           |
   5823 | 6   |       TABLE ACCESS STORAGE FULL | BIGTAB  |  276M | 5248M |   63K |  00:02:09 |:Q1000| PCWP |           |
   5824 --------------------------------------------------+-----------------------------------+-------------------------+
   5825 Predicate Information:
   5826 ----------------------
   5827 6 - storage(:Z>=:Z AND :Z<=:Z AND "ID">1)
   5828 6 - filter("ID">1)
   5829
   5830 Content of other_xml column
   5831 ===========================
   5832   derived_cpu_dop: 80
   5833   derived_io_dop : 40
   5834   dop_reason     : degree limit
   5835   dop            : 48
   5836   px_in_memory_imc: no
   5837   px_in_memory   : no
   5838   io_rate        : 200
   5839   derived_io_dop : 40
   5840   cpu_rate       : 1000
   5841   derived_cpu_dop: 52
   5842   cpu_rate       : 1000
   5843   derived_cpu_dop: 29
   5844   cpu_rate       : 1000
   5845   derived_cpu_dop: 29
   5846   db_version     : 12.1.0.2
   5847   parse_schema   : MARTIN
   5848   dynamic_sampling: 11
   5849   plan_hash_full : 1216263238
   5850   plan_hash      : 4050997022
   5851   plan_hash_2    : 1216263238
   5852   Outline Data:
   5853   /*+
   5854     BEGIN_OUTLINE_DATA
   5855       IGNORE_OPTIM_EMBEDDED_HINTS
   5856       OPTIMIZER_FEATURES_ENABLE('12.1.0.2')
   5857       DB_VERSION('12.1.0.2')
   5858       OPT_PARAM('optimizer_dynamic_sampling' 11)
   5859       ALL_ROWS
   5860       SHARED(48)
   5861       OUTLINE_LEAF(@"SEL$1")
   5862       FULL(@"SEL$1" "BIGTAB"@"SEL$1")
   5863     END_OUTLINE_DATA
   5864   */

There you have it! The v$sql_plan.other_xml column contains all the necessary information-no need for a 10053 trace :). The fanciest way I can come up with to get that information is to use this statement:

select t.*
 from v$sql_plan v,
  xmltable(
	'/other_xml/info'
	passing xmltype(v.other_xml)
	columns
		info_type varchar2(30) path '@type',
		info_value varchar2(30) path '/info'
  ) t
where v.sql_id = '&sql_id'
  and v.child_number = &child_number
  and other_xml is not null;

INFO_TYPE                      INFO_VALUE
------------------------------ ------------------------------
derived_cpu_dop                51
derived_io_dop                 40
dop_reason                     degree limit
dop                            48
px_in_memory_imc               no
px_in_memory                   no
io_rate                        200
derived_io_dop                 40
cpu_rate                       1000
derived_cpu_dop                26
cpu_rate                       1000
derived_cpu_dop                26
cpu_rate                       1000
derived_cpu_dop                26
db_version                     12.1.0.2
parse_schema                   "MARTIN"
plan_hash_full                 1216263238
plan_hash                      4050997022
plan_hash_2                    1216263238

19 rows selected.

But I digress… By the way-this statement used 48 PX server processes because it was the only one active at the time.

Causing trouble

After hitting publish I noticed that I copied the wrong example…

An example working better for the purpose of demonstrating the effect of queueing uses a more realistic case where statements use different DOPs. Two sessions are enough, one that will use 12 slaves, and another that requests 36. The statement with a DOP of 12 is started first, the one with a DOP of 36 a few seconds later. In a third session I gathered some diagnostic information:

SYS:dbm011> select px.sid, px.serial#,px.qcsid,px.qcserial#,px.qcinst_id,px.degree,px.req_degree,
  2  s.username, s.sql_id, s.sql_child_number, s.event, s.state
  3  from v$px_session px, v$session s
  4  where s.sid = px.sid
  5    and s.serial# = px.serial#
  6  order by px.qcsid;

        SID     SERIAL#       QCSID   QCSERIAL#   QCINST_ID      DEGREE  REQ_DEGREE USERNAME             SQL_ID        SQL_CHILD_NUMBER EVENT                                                            STATE
----------- ----------- ----------- ----------- ----------- ----------- ----------- -------------------- ------------- ---------------- ---------------------------------------------------------------- --------------------
        265       14828         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED SHORT TIME
        330       21719         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        395       59827         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED SHORT TIME
        461       17363         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        525        4955         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        592        2687         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        526       60805         526                                                 MARTIN               9md3ukhghgj58                0 PX Deq: Execute Reply                                            WAITING
        720       27626         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        786       47540         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED KNOWN TIME
        850       41110         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED SHORT TIME
        918       46836         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        983       12699         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        657       22654         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING

13 rows selected.

SYS:dbm011> select count(qcsid),qcsid from v$px_session group by qcsid;

COUNT(QCSID)       QCSID
------------ -----------
          13         526

SYS:dbm011>
SYS:dbm011> select count(event),event from v$session where username = 'MARTIN' group by event;

COUNT(EVENT) EVENT
------------ ----------------------------------------------------------------
          12 cell smart table scan
           1 PX Deq: Execute Reply
           1 resmgr:pq queued

SYS:dbm011>

So my first session is well underway, 12 way parallel as requested. The other session had to wait but kicked into motion as soon as my first session finished.

SYS:dbm011> select px.sid, px.serial#,px.qcsid,px.qcserial#,px.qcinst_id,px.degree,px.req_degree,
  2  s.username, s.sql_id, s.sql_child_number, s.event, s.state
  3  from v$px_session px, v$session s
  4  where s.sid = px.sid
  5    and s.serial# = px.serial#
  6  order by px.qcsid;

        SID     SERIAL#       QCSID   QCSERIAL#   QCINST_ID      DEGREE  REQ_DEGREE USERNAME             SQL_ID        SQL_CHILD_NUMBER EVENT                                                            STATE
----------- ----------- ----------- ----------- ----------- ----------- ----------- -------------------- ------------- ---------------- ---------------------------------------------------------------- --------------------
        265       43575         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        525       17778         525                                                 MARTIN               4dz3dapm4uy8q                0 PX Deq: Execute Reply                                            WAITING
        395       10504         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        462       57813         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        526       12648         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        591       42643         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        657        5236         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        720       19243         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        786       61636         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        850       17734         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
        918       60016         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        983       24489         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1044       41056         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1108        8936         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1178       52303         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
       1242       37008         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1307       28798         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
       1372       22172         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
       1437       59990         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
       1501        9307         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
          6       39400         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
         72       28569         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        137       18368         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        202       56832         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        333        6482         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        397       28334         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        461       17550         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        524         958         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        593       41235         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        656       51686         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
        719       40346         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
        785        5157         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        851       35243         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
        917       45635         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        980       18504         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1046       27779         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        330       10724         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING

37 rows selected.

SYS:dbm011>
SYS:dbm011> select count(qcsid),qcsid from v$px_session group by qcsid;

COUNT(QCSID)       QCSID
------------ -----------
          37         525

SYS:dbm011>
SYS:dbm011> select count(event),event from v$session where username = 'MARTIN' group by event;

COUNT(EVENT) EVENT
------------ ----------------------------------------------------------------
          36 cell smart table scan
           1 PX Deq: Execute Reply

The effect of queueing can be seen in the extended execution time:

  • 12 way parallel-no queueing: Elapsed: 00:00:07.16
  • 36 way parallel-with a bit of queueing: Elapsed: 00:00:12.92

V$PX_SESSION allows you to identify the culprit causing other sessions to queue-have a look at it and see if you can optimise the statement or run it later. This should unblock you. Also have a look at parallel_servers_target- you might have limited yourself too much. I personally don’t exceed the core/thread count on a box by too much to avoid scheduling problems.