Martins Blog

Trying to explain complex things in simple terms

Can you have high redundancy files in a normal redundancy diskgroup?

Posted by Martin Bach on September 11, 2014

One of the perks of teaching classes is that I get to research questions asked. In the last Exadata Administration Class I taught someone asked: can you have your disk groups in Exadata on normal redundancy yet have certain databases use high redundancy? This would be a good interview question …

The answer is yes, which I remembered from researching material on the 11g RAC book but I wanted to prove that it is the case.

Update: I planned a second blog post where I wanted to test the effect but Alex Fatkulin was quicker, and I promise I didn’t see his post when I wrote mine. Otherwise there probably wouldn’t have been one :) In summary, you aren’t really any better protected. The disk group remains at normal redundancy, even with the data files in high. Looking at Alex’s results (and I encourage you to do so) I concur with his summary that although you have a 3rd copy of the extent protecting you from corruption, you don’t have higher resilience.

This is not Exadata specific by the way. When I face a question around ASM and internals my first idea is to use an Internet search engine and look up the ASM work by Luca Canali. This page is almost always of relevance when looking at ASM file structures: https://twiki.cern.ch/twiki/bin/view/PDBService/ASM_Internals.

ASM Templates

I am not aware of many users of ASM who know about ASM templates, although we all make use of them. The templates define the striping and mirroring of files, and are explained in the ASM Admin Guide chapter 5. You are not limited to using the Oracle provided templates, you can create your own as well, which is key to this post. I headed over to the ASM instance on my cluster and created the template:

SQL> alter diskgroup data add template adminclassHigh attributes(high);

Diskgroup altered.

With the template in place I can create a high redundancy tablespace for example in my normal redundancy diskgroup:

SQL> select name, group_number, type from v$asm_diskgroup;

NAME                           GROUP_NUMBER TYPE
------------------------------ ------------ ------
RECO                                      3 NORMAL
DBFS_DG                                   2 NORMAL
DATA                                      1 NORMAL

Unlike what the documentation suggests you do not need to change db_create_file_dest for this to work.

SQL> show parameter db_create_file_dest

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_create_file_dest                  string      +DATA

Let’s put that template to use.

SQL> create tablespace testhigh datafile '+data(adminclassHigh)' size 10m;

Tablespace created.

SQL> select file_name from dba_data_Files;

FILE_NAME
-----------------------------------------------------------------------------------------------
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/system.363.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/sysaux.364.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/users.365.857664253
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/testhigh.357.857880753

Note that I specify the newly created template in () in the datafile clause. The tablespace has been created, next I need the file number and the incarnation to look up how the extents are distributed.

SQL> select DISK_KFFXP,AU_KFFXP,PXN_KFFXP,XNUM_KFFXP,LXN_KFFXP 
  2   from x$kffxp  where number_kffxp = 357 and incarn_kffxp = 857880753;

DISK_KFFXP   AU_KFFXP  PXN_KFFXP XNUM_KFFXP  LXN_KFFXP
---------- ---------- ---------- ---------- ----------
         4      73706          0          0          0
        23      73767          1          0          1
        24      28895          2          0          2
         1      73679          3          1          0
        27      30015          4          1          1
        22      73717          5          1          2
        21      73725          6          2          0
         7      73698          7          2          1
        33      73707          8          2          2

9 rows selected.

What looks a little cryptic can be translated using Luca’s notes as:

  • DISK is the ASM disk where the extent is located (as in V$ASM_DISK)
  • AU indicates the relative position of the allocation unit from the beginning of the disk
  • PXN is the progressive extent number ( = actual extent)
  • XNUM is the ASM file extent number (mirrored extent pairs have the same extent value)
  • LXN indicates the type of extent: 0 = primary extent, 1 = first mirror copy, 2 = second mirror copy

So you can see that each primary extent has 2 mirror copies, also known as high redundancy.

If you omit the template in the datafile clause you get what you’d expect: normal redundancy.

SQL> drop tablespace testhigh including contents and datafiles;

Tablespace dropped.

SQL> create tablespace testhigh datafile '+data' size 10m;

Tablespace created.

SQL>  select file_name from dba_data_Files;

FILE_NAME
-----------------------------------------------------------------------------------------------
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/system.363.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/sysaux.364.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/users.365.857664253
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/testhigh.357.857880927

And on ASM:

SQL> select DISK_KFFXP,AU_KFFXP,PXN_KFFXP,XNUM_KFFXP,LXN_KFFXP
  2  from x$kffxp  where number_kffxp = 357 and incarn_kffxp = 857880927;

DISK_KFFXP   AU_KFFXP  PXN_KFFXP XNUM_KFFXP  LXN_KFFXP
---------- ---------- ---------- ---------- ----------
        30      73724          0          0          0
        16      73749          1          0          1
         0      73690          2          1          0
        28      73682          3          1          1
        17      73722          4          2          0
        10      73697          5          2          1

6 rows selected.

As you can see each extent has 1 mirror copy, not 2.

Posted in Automatic Storage Management | 1 Comment »

Intra-Database IORM in action

Posted by Martin Bach on September 10, 2014

I have been teaching the Enkitec Exadata Administration Class this week and made an interesting observation I thought was worth sharing with regards to IO Resource Management on Exadata.

I have created a Database Resource Manager (DBRM) Plan that specifically puts a resource consumer group to a disadvantage. Actually, quite severely so but the following shouldn’t be a realistic example in the first place: I wanted to prove a point. Hang-on I hear you say: you created a DBRM plan-the post has IORM in the subject though: what gives? Please allow me to explain.

Exadata offers 3 different ways to implement IORM to the keen engineer:

  • Intra-Database IORM
  • Inter-Database IORM
  • Category IORM

The latter 2 need to be implemented on the cells using the cellcli “alter IORMPLAN” directive, documented in the Storage Server User’s Guide in chapter 6. The first one though is pushed down to the cells when activated on the database, an activity that is visible in the cell alert.log. So when you create and enable a DBRM plan it will automatically become an IORM plan as well.

Setup

The code to create the DBRM plan is quite simple. I create a plan named ENKITEC_DBRM and two new consumer groups:

  • LOWPRIO_GROUP
  • HIGHPRIO_GROUP

HIGHPRIO_GROUP gets 100% of mgmt_p2 where LOWPRIO_GROUP gets 25% of the remaining resources on level mgmt_p3 and has to share them with the OTHER_GROUP as well. Not that you would do that in real life…. There was no change to the cpu_count, which is 24 on an X2 compute node (2s12c24t Westmere Xeon processors).

After the setup was complete I created a SQL script that I’m planning on executing in 50 sessions: 25 in low priority, and 25 in high priority sessions. The script is a simple “select count(*) from reallybigtable” – in this case an 80 GB table with 256 million rows, uncompressed.

Running SQLPLUS concurrently

I wrote a small shell script that is capable of launching a user-configurable number of SQLPLUS sessions against the database and included some timing information to measure the time-to-completion for all sessions. With that at hand I started 25 sessions for the low priority group and 25 for the high priority consumer group. In a third SSH-session I connected to the first cell in the quarter rack (it’s an X2 by the way) and repeatedly executed metric_iorm.pl. This little gem was written for cellserv 11.2.x and lists IORM metrics for each database in the system as well as a summary of IO activity against the cell. I explicitly mentioned cellsv 11.2.x because the new IORM instrumentation for PDBs is missing from the output. So far you can only realistically monitor non-CDBs with it when using RDBMS 12c. The perl script is available from “Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl (Doc ID 1337265.1)”. Deployed to celladmin’s home I was ready to start the test.

Test result

Executing metric_iorm.pl immediately after initiating the heavy IO load against the cells does not report a lot of useful data. It appears as if the contents of “metriccurrent” isn’t updated straight away but with a slight delay. Note there is no typical resource manager wait event such as “cpu quantum” here:

SQL> select username,event,state,sql_id from v$session where username like '%PRIO';

USERNAME                       EVENT                          STATE               SQL_ID
------------------------------ ------------------------------ ------------------- -------------
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITED KNOWN TIME   6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITED SHORT TIME   6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz

50 rows selected.

Almost all of the sessions were happily scanning the table, I am not aware of a way of finding out about IORM throttling on the RDBMS level but that doesn’t mean there is a way of finding out! In the end the sessions of LOWPRIO_GROUP took 320 seconds to scan the table, the sessions in HIGHPRIO_GROUP only needed 199 seconds to finish their scans of the table. So clearly IORM was at work here.

IO Performance

The end result was to be expected, but I wanted to know more. In this case the interesting bit is found on the cell. Please note that I only looked at the first cell in this X2 quarter rack, I didn’t check the combined throughput across all cells. The query I executed wasn’t making any use of Exadata features either, it was a brute force scan across all rows in the table in the absence of a where-clause. I just needed a lot of IO to demonstrate the use of IORM.

When all 50 sessions were active I could observe the following information from metric_iorm.pl for my database:

Database: DBM01
Utilization:     Small=0%    Large=97%
Flash Cache:     IOPS=45702
Disk Throughput: MBPS=0
Small I/O's:     IOPS=0.7    Avg qtime=8.8ms
Large I/O's:     IOPS=1115    Avg qtime=1941ms
        Consumer Group: HIGHPRIO_GROUP
        Utilization:     Small=0%    Large=94%
        Flash Cache:     IOPS=43474
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=0.0 
        Large I/O's:     IOPS=1064 
        Consumer Group: _ORACLE_BACKGROUND_GROUP_
        Utilization:     Small=0%    Large=0%
        Flash Cache:     IOPS=6.6
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=0.7
        Large I/O's:     IOPS=0.0
        Consumer Group: LOWPRIO_GROUP
        Utilization:     Small=0%    Large=3%
        Flash Cache:     IOPS=2222
        Disk Throughput: MBPS=52
        Small I/O's:     IOPS=0.0
        Large I/O's:     IOPS=50.4

You can see that IORM is in action: the low priority group is literally starved out, which is correct if you consider the plan I implemented. I have previously put my reallybigtable to good use and ended up with a lot of it in flash cache (thanks to cellsrv 11.2.3.3.x+ you benefit from flash cache during smart scans without having to pin the segment to it). Interestingly the disk throughput is 0 MB per second for the high priority group (which looks quite wrong to me, I need to debug the script). The low priority consumer group apparently makes use of spinning disk, although the combined throughput is lower.

After the 25 sessions from the high priority resource consumer group finished, it looked better for the sessions in the low priority group:

Database: DBM01
Utilization:     Small=0%    Large=88%
Flash Cache:     IOPS=44351
Disk Throughput: MBPS=0
Small I/O's:     IOPS=3.2    Avg qtime=0.0ms
Large I/O's:     IOPS=1080    Avg qtime=830ms
        Consumer Group: _ORACLE_BACKGROUND_GROUP_
        Utilization:     Small=0%    Large=0%
        Flash Cache:     IOPS=8.4
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=3.2  
        Large I/O's:     IOPS=0.0
        Consumer Group: LOWPRIO_GROUP
        Utilization:     Small=0%    Large=88%
        Flash Cache:     IOPS=44342
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=0.0
        Large I/O's:     IOPS=1080

You can no longer see that there are other sessions than background and LOWPRIO_GROUP active. The number of IOPS from Flash Cache went up significantly to the extent that disk isn’t reported to be used anymore. I need to verify that-it doesn’t look quite right to me either.

Summary

IO Resource Manager sadly is a much underused tool and few Exadata users seem to know it is active as soon as you implement a DBRM plan (the objective parameter notwithstanding). I am hoping that this post shows that using DBRM/IORM on Exadata is worth investigating time and effort to as it is a key enabler for database consolidation and more fancy topics such as DBaaS. And you don’t get something comparable on anything else that runs the Oracle software.

Posted in 12c Release 1, Exadata | Tagged: | Leave a Comment »

A brief history of time^H^H Oracle session statistics

Posted by Martin Bach on August 6, 2014

I didn’t intend to write another blog post yesterday evening at all, but found something that was worth sharing and got me excited… And when I started writing I intended it to be a short post, too.

If you have been digging around Oracle session performance counters a little you undoubtedly noticed how their number has increased with every release, and even with every patch set. Unfortunately I don’t have a 11.1 system (or earlier) at my disposal to test, but here is a comparison of how Oracle has instrumented the database. I have already ditched my 12.1.0.1 system as well, so no comparison there either :( This is Oracle on Linux.

The script

In the following examples I am going to use a simple query to list the session statistics by their class. The decode statement is based on the official documentation set. There you find the definition of v$statname plus an explanation of the meaning of the class-column. Here is the script:

with stats as (
        select name, decode(class,
                1, 'USER',
                2, 'REDO',
                4, 'ENQUEUE',
                8, 'CACHE',
                16, 'OS',
                32, 'RAC',
                64, 'SQL',
                128, 'DEBUG',
                'NA'
        ) as decoded_class from v$statname
)
select count(decoded_class), decoded_class
 from stats
 group by rollup(decoded_class)
 order by 1
/

Oracle 11.2.0.3

11.2.0.3 is probably the most common 11g Release 2 version currently out there in the field. Or at least that’s my observation. According to MOS Doc ID 742060.1 11.2.0.3 was released on 23 September 2011 (is that really that long ago?) and already out of error correction support by the way.

Executing the above-mentioned script gives me the following result:

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  25 RAC
                  32 REDO
                  47 NA
                  93 SQL
                 107 USER
                 121 CACHE
                 188 DEBUG
                 638

So there are 638 of these counters. Let’s move on to 11.2.0.4

Oracle 11.2.0.4

Oracle 11.2.0.4 is interesting as it has been released after 12.1.0.1. It is the terminal release for Oracle 11.2, and you should consider migrating to it as it is in error correction support. The patch set came out on 28 August 2013. What about the session statistics?

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  25 RAC
                  34 REDO
                  48 NA
                  96 SQL
                 117 USER
                 127 CACHE
                 207 DEBUG
                 679

A few more, all within what can be expected.

Oracle 12.1.0.2

Oracle 12.1.0.2 is fresh off the press, released just a few weeks ago. Unsurprisingly the number of session statistics has been increased again. What did surprise me was the number of statistics now available for every session! Have a look at this:

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  35 RAC
                  68 REDO
                  74 NA
                 130 SQL
                 130 USER
                 151 CACHE
                 565 DEBUG
                1178

That’s nearly double what you found for 11.2.0.3. Incredible, and hence this post. Comparing 11.2.0.4 with 12.1.0.2 you will notice the:

  • same number of enqueue stats
  • same number of OS stats
  • 10 additional RAC stats
  • twice the number of REDO related statistics
  • quite a few more not classified (26)
  • 34 more sql related
  • 13 more in the user-class
  • 24 additional stats in the cache-class
  • and a whopping 298 (!) in the debug class

The debug class (128) shows lots of statistics (including spare ones) for the in-memory option (IM):

SQL> select count(1), class from v$statname where name like 'IM%' group by class;

  COUNT(1)      CLASS
---------- ----------
       211        128

Happy troubleshooting! Reminds me to look into the IM-option in more detail.

Posted in 11g Release 2, 12c Release 1 | Tagged: , , | 2 Comments »

A first look at RAC 12c (part I)

Posted by Martin Bach on August 4, 2014

I have recently upgraded my RAC 12.1.0.1.3 system to RAC 12.1.0.2 including the RDBMS installation. Currently I am updating my skills with information relevant to what I would normally have called 12c Release 2 (so that would also answer the question: when is 12c Release 2 coming out?). Then I realised I haven’t posted a first look at RAC 12c post yet-so here it comes.

There are a few things that aren’t specifically mentioned in the new features guide that caught my eye. First of all, RAC 12 does a few really cool things. Have a look at the srvctl command output:

[oracle@rac12node1 ~]$ srvctl
Usage: srvctl <command> <object> [<options>]
    commands: enable|disable|export|import|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|
       config|convert|update|upgrade|downgrade|predict
    objects: database|instance|service|nodeapps|vip|network|asm|diskgroup|listener|srvpool|server|scan|scan_listener|
        oc4j|home|filesystem|gns|cvu|havip|exportfs|rhpserver|rhpclient|mgmtdb|mgmtlsnr|volume|mountfs
For detailed help on each command and object and its options use:
  srvctl <command> -help [-compatible] or
  srvctl <command> <object> -help [-compatible]
[oracle@rac12node1 ~]$

Quite a few more than with 11.2.0.3:

[oracle@rac112node1 ~]$ srvctl
Usage: srvctl <command> <object> [<options>]
 commands: enable|disable|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|config|convert|upgrade
 objects: database|instance|service|nodeapps|vip|network|asm|diskgroup|listener|srvpool|server|scan|scan_listener|oc4j|home|filesystem|gns|cvu
For detailed help on each command and object and its options use:
 srvctl <command> -h or
 srvctl <command> <object> -h

I will detail the meaning of some of these later in this post or another one to follow.

Evaluation and Prediction

When you are working with policy managed databases RAC 12c already gave you a “what if” option in form of the -eval flag. If for example you wanted to grow your server pool from 2 to 3 nodes:

[oracle@rac12node1 ~]$ srvctl modify srvpool -serverpool pool1 -max 3 -eval -verbose
Database two will be started on node rac12node3
Server rac12node3 will be moved from pool Free to pool pool1
[oracle@rac12node1 ~]$

Now you will be able to predict a resource failure as well:

[oracle@rac12node1 ~]$ srvctl predict -h

The SRVCTL predict command evaluates the consequences of resource failure.

Usage: srvctl predict database -db <database_name> [-verbose]
Usage: srvctl predict service -db <database_name> -service <service_name> [-verbose]
Usage: srvctl predict asm [-node <node_name>] [-verbose]
Usage: srvctl predict diskgroup -diskgroup <diskgroup_name> [-verbose]
Usage: srvctl predict filesystem -device <volume_device> [-verbose]
Usage: srvctl predict vip -vip <vip_name> [-verbose]
Usage: srvctl predict network [-netnum <network_number>] [-verbose]
Usage: srvctl predict listener -listener <listener_name> [-verbose]
Usage: srvctl predict scan -scannumber <scan_ordinal_number> [-netnum <network_number>] [-verbose]
Usage: srvctl predict scan_listener -scannumber <scan_ordinal_number> [-netnum <network_number>] [-verbose]
Usage: srvctl predict oc4j [-verbose]

So what would happen if a disk group failed?

[oracle@rac12node1 ~]$ srvctl predict diskgroup -diskgroup DATA -verbose
Resource ora.DATA.dg will be stopped
Resource ora.DATA.ORAHOMEVOL.advm will be stopped
[oracle@rac12node1 ~]$

What it doesn’t do at this stage seems to be an assessment of cascading further problems. If +DATA went down, it would pretty much drag the whole cluster with it, too.

Status

Interestingly you can see a lot more detail with 12.1.0.2 than previously. Here is an example of a policy-managed RAC One Node database:

[oracle@rac12node1 ~]$ srvctl config database -d RONNCDB
Database unique name: RONNCDB
Database name: RONNCDB
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile: +DATA/RONNCDB/PARAMETERFILE/spfile.319.854718651
Password file: +DATA/RONNCDB/PASSWORD/pwdronncdb.290.854718263
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: ronpool1
Disk Groups: RECO,DATA
Mount point paths:
Services: NCDB
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: RONNCDB
Candidate servers:
OSDBA group: dba
OSOPER group:
Database instances:
Database is policy managed

Did you spot the OSDBA and OSOPER group mappings in the output? DBCA by default creates the password file and server parameter file into ASM since 12.1.0.1.

You can get a lot more status information in 12.1.0.2 then previously, especially when compared to 11.2:

[oracle@rac12node1 ~]$ srvctl status -h

The SRVCTL status command displays the current state of the object.

Usage: srvctl status database {-db <db_unique_name> [-serverpool <serverpool_name>] | -serverpool <serverpool_name> | -thisversion | -thishome} [-force] [-verbose]
Usage: srvctl status instance -db <db_unique_name> {-node <node_name> | -instance <inst_name_list>} [-force] [-verbose]
Usage: srvctl status service {-db <db_unique_name> [-service  "<service_name_list>"] | -serverpool <serverpool_name> [-db <db_unique_name>]} [-force] [-verbose]
Usage: srvctl status nodeapps [-node <node_name>]
Usage: srvctl status vip {-node <node_name> | -vip <vip_name>} [-verbose]
Usage: srvctl status listener [-listener <lsnr_name>] [-node <node_name>] [-verbose]
Usage: srvctl status asm [-proxy] [-node <node_name>] [-detail] [-verbose]
Usage: srvctl status scan [[-netnum <network_number>] [-scannumber <scan_ordinal_number>] | -all] [-verbose]
Usage: srvctl status scan_listener [[-netnum <network_number>] [-scannumber <scan_ordinal_number>] | -all] [-verbose]
Usage: srvctl status srvpool [-serverpool <pool_name>] [-detail]
Usage: srvctl status server -servers "<server_list>" [-detail]
Usage: srvctl status oc4j [-node <node_name>] [-verbose]
Usage: srvctl status rhpserver
Usage: srvctl status rhpclient
Usage: srvctl status home -oraclehome <oracle_home> -statefile <state_file> -node <node_name>
Usage: srvctl status filesystem [-device <volume_device>] [-verbose]
Usage: srvctl status volume [-device <volume_device>] [-volume <volume_name>] [-diskgroup <group_name>] [-node <node_list> | -all]
Usage: srvctl status diskgroup -diskgroup <dg_name> [-node "<node_list>"] [-detail] [-verbose]
Usage: srvctl status cvu [-node <node_name>]
Usage: srvctl status gns [-node <node_name>] [-verbose]
Usage: srvctl status mgmtdb [-verbose]
Usage: srvctl status mgmtlsnr [-verbose]
Usage: srvctl status exportfs [-name <expfs_name> |-id <havip id>]
Usage: srvctl status havip [-id <id>]
Usage: srvctl status mountfs -name <mountfs_name>
For detailed help on each command and object and its options use:
  srvctl <command> <object> -help [-compatible]

RAC 12.1.0.2 adds a nifty few little flags: thisversion and thishome to srvctl status database. That works really well where you have multiple versions of Oracle on the same machine (think consolidation):

[oracle@rac12node1 ~]$ srvctl status database -thisversion
Database unique name: RONNCDB
Instance RONNCDB_1 is running on node rac12node4
Online relocation: INACTIVE

Database unique name: TWO
Instance TWO_1 is running on node rac12node1
Instance TWO_2 is running on node rac12node2

Verbose!

Some commands are actually more verbose when you specify the -verbose flag:

[oracle@rac12node1 ~]$ srvctl status database -d RONNCDB -verbose
Instance RONNCDB_1 is running on node rac12node4 with online services NCDB. Instance status: Open.
Online relocation: INACTIVE
[oracle@rac12node1 ~]$ srvctl status database -d RONNCDB
Instance RONNCDB_1 is running on node rac12node4
Online relocation: INACTIVE
[oracle@rac12node1 ~]$

But that’s not new in 12.1.0.2 I believe.

Interesting changes for database logging

The database itself will also tell you more about memory allocation:

**********************************************************************
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
 Per process system memlock (soft) limit = 64K
Thu Jul 31 13:34:58 2014
 Expected per process system memlock (soft) limit to lock
 SHARED GLOBAL AREA (SGA) into memory: 1538M
Thu Jul 31 13:34:58 2014
 Available system pagesizes:
  4K, 2048K
 Supported system pagesize(s):
  PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)
        4K       Configured               3          393219        NONE
     2048K                0             769               0        NONE

RECOMMENDATION:
 1. For optimal performance, configure system with expected number
 of pages for every supported system pagesize prior to the next
 instance restart operation.
 2. Increase per process memlock (soft) limit to at least 1538MB
 to lock 100% of SHARED GLOBAL AREA (SGA) pages into physical memory

As you can see I am not using large pages here at all, which I did for demonstration purposes only. I don’t see any reason not to use large pages on a 64bit system these days. I’m curious to see whether the AIX port supports all the AIX page sizes here.

End of part I

This has already turned into a longer post than I expected it to be when I started writing. I think I’ll continue the series in a couple of weeks when I’m finding the time.

Posted in 12c Release 1, RAC | Leave a Comment »

How to get insights into the Linux Kernel

Posted by Martin Bach on August 1, 2014

This is probably as much a note-to-self as it can possibly be. Recently I have enjoyed some more in-depth research about how the Linux kernel works. To that extent I started fairly low-level. Theoretically speaking, you need to understand the hardware-software interface first before you can understand the upper levels. But in practice you get by with less knowledge. But if you are truly interested in how computers work you might want to consider reading up on some background. Some very knowledgable people I deeply respect have recommended books by David A. Patterson and John L. Hennessy. I have these two:

  • Computer Organization and Design, Fifth Edition: The Hardware/Software Interface
  • Computer Architecture, Fifth Edition: A Quantitative Approach

I think I found a few references to the above books in James Morle’s recent blog article about the true cost of licensing the in-memory database option and he definitely refers to the second book in his Sane SAN paper. I complemented these books with The Linux Programming Interface: A Linux and UNIX System Programming Handbook to get an overview of the Linux API. Oh and Linux from Scratch is a great resource too!

The Foundation is set

But now-what next? The Linux kernel evolves rather quickly, and don’t be fooled by version numbers. The “enterprise” kernels keep a rather conservative, static version number scheme. Remember 2.6.18? The kernel with RHEL 5.10 has little in common with the one released years and years ago with RHEL 5.0. SuSE seems to be more aggressive, naming kernels differently. A good discussion of the pros and cons for that approach can be found on LWN: http://lwn.net/Articles/486304/ Long story short: the Linux kernel developers keep pushing the limits with the “upstream” or “vanilla” kernel. You can follow the development on the LKML or Linux Kernel Mailing List. But that list is busy… The distribution vendors in turn take a stable version of the kernel and add features they need. That includes back-porting as well, which is why it’s so hard to see what’s going on with a kernel internally. But there are exceptions.

The inner workings

Apologies to all SuSE and Red Hat geeks: I haven’t been able to find a web-repository for the kernel code! If you know of one and have the URL, let me know and I’ll add it here. I don’t want to sound biased but it simply happens to be that I know Oracle Linux best.

Now to really dive into the internals and implementation you need to look at the source code. When browsing the code it helps to understand the C-programming language. And maybe some Assembler. I would love to know more about Assembler than I do but I don’t believe it’s strictly speaking necessary.

Oracle publishes the kernel code at the GIT repositories on  oss.oracle.com:

Oracle also provides patches for Red Hat kernels in project Red Patch. If I understand things correctly then Red Hat provides changes to the kernel in a massive tarball with the patches already applied. Previously it appears to have shipped the kernel + patches, which caused some controversy.

The Linux Cross Reference gives you insights into the upstream kernel.

NB: Kernel documentation can be found in the Documentation subdirectory. This is very useful stuff!

Now why would you want to do this?

My use case! I wanted to find out if/how I could do NFS over RDMA. When in doubt, use an Internet search engine and common sense. In this case: use the kernel documentation and sure enough, NFS-RDMA seems possible.

https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt

The link suggests a few module names and pre-requisites on enabling NFA-RDMA. The nfs-utils package must be version 1.1.2 or later, and the kernel NFS server must be built with RDMA support. Using the kernel source RPM you can check the options being used for compiling the kernel. Normally you’d use make menuconfig or an equivalent to enable/disable options or to build them as modules (refer to the excellent Linux From Scratch). Except that you don’t do that with the enterprise distributions of course. Building kernels for fun is off limits on these. If you have a problem with the Linux kernel (like a buggy kernel module), your vendor provides the fix, not the Linux engineer. But I digress… Each subtree in the kernel has a Kconfig file that lists the configuration option and meaning.

For the purpose of NFS-RDMA Infiniband support must be enabled (no brainer), but also IPoIB and then the RDMA support for NFS (“sunrpc”).

Back to the source RPM: it installs a file called .config in /usr/src/kernels/nameAndVersion/ listing all the build options. Grepping for RDMA in the file shows the following for UEK 3:

[root@rac12node1 3.8.13-35.3.3.el6uek.x86_64]# grep -i rdma .config
CONFIG_RDS_RDMA=m
CONFIG_NET_9P_RDMA=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_INFINIBAND_OCRDMA is not set
CONFIG_SUNRPC_XPRT_RDMA_CLIENT=m
# CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL is not set
CONFIG_SUNRPC_XPRT_RDMA_SERVER=m

And here is the same for UEK 2:

[root@server1 2.6.39-400.17.1.el6uek.x86_64]# grep -i rdma .config
CONFIG_RDS_RDMA=m
CONFIG_NET_9P_RDMA=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
CONFIG_SUNRPC_XPRT_RDMA=m

So that looks promising, the letter “m” stands for “module”. But what do these options mean? The Kconfig file to the rescue again, but I first have to find the correct one. This example is for UEK 2:

[root@server1 2.6.39-400.17.1.el6uek.x86_64]# for file in $(rpm -qil kernel-uek-devel | grep Kconfig );
> do grep -i SUNRPC_XPRT_RDMA $file /dev/null;
> done
/usr/src/kernels/2.6.39-400.17.1.el6uek.x86_64/net/sunrpc/Kconfig:config SUNRPC_XPRT_RDMA

Found you! Notice that I’m adding /dev/null to the grep command to get the file name where grep found a match. Looking at the file just found:

config SUNRPC_XPRT_RDMA
        tristate
        depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS && EXPERIMENTAL
        default SUNRPC && INFINIBAND
        help
          This option allows the NFS client and server to support
          an RDMA-enabled transport.

          To compile RPC client RDMA transport support as a module,
          choose M here: the module will be called xprtrdma.

          If unsure, say N.

All that remained to be done was to check if these other configurationvariables (INFINIBAND, INFINIBAND_ADDR_TRANS etc) were set in the top level .config file and they were.

Posted in Linux | Tagged: , , | 1 Comment »

Upgrading clustered Grid Infrastructure to 12.1.0.2 from 12.1.0.1.3

Posted by Martin Bach on July 29, 2014

Oracle 12.1.0.2 is out, after lots of announcements the product has finally been released. I had just extended my 12.1.0.1.3 cluster to 3 nodes and was about to apply the July PSU when I saw the news. So why not try and upgrade to the brand new thing?

What struck me at first was the list of new features … Oracle’s patching strategy has really changed over time. I remember the days when Oracle didn’t usually add additional features into point releases. Have a look at the new 12.1.0.2 features and that would possibly qualify to be 12c Release 2…

In summary the upgrade process is actually remarkably simple, and hasn’t changed much since earlier versions of the software. Here are the steps in chronological order.

./runInstaller

I don’t know how often I have type ./ruinInstaller instead of runInstaller, but here you go. This is the first wizard screen after splash screen has disappeared.

GI 12.1.0.2-001

Naturally I went for the upgrade of my cluster. Before launching the installer though I made sure that everything was in working order by means of cluvfy. On to the next screen:

GI 002

always install English only. Troubleshooting Oracle in a different language (especially if I don’t speak or understand) is really hard so I avoid it in the first place.

Over to the screen that follows and oops-my SYSDG disk group (containing OCR and voting files) is too small. Bugger. In the end I added 3 new 10GB LUNs and dropped the old ones. But it took me a couple of hours to do so. Worse: it wasn’t even needed, but proved to be a good learning exercise. The requirement to have that much free space is most likely caused by the management repository and related infrastructure.

GI 003 error

Back to this screen everything is in best order, the print screen has been taken just prior to the change to the next. Note the button to skip the updates on unreachable nodes. Not sure if I wanted to do that though.

GI 003

I haven’t got OEM agents on the servers (yet) so I’m skipping the registration for now. You can always do that later.

GI 004

This screen is familiar; I am keeping my choices from the initial installation. Grid Infrastructure is owned by Oracle despite the ASMDBA and ASMADMIN groups by the way.

GI 005

On the screen below you define where on the file system you want to install Grid Infrastructure. Remember that for clustered deployments the ORACLE_HOME cannot be in the path of the ORACLE_BASE. For this to work you have to jump to the command line and create the directory on all servers and grant ownership to the GI owner account (oracle in this case, could be grid as well).

GI 006

Since I like to be in control I don’t allow Oracle to run the root scripts. I didn’t in 12.1.0.1 either:

GI 007

In that screen you notice the familiar checking of requirements.

GI 008

In my case there were only a few new ones shown here. This is a lab server so I don’t plan on using swap, but the kernel parameter “panic_on_oops” is new. I also didn’t set the reverse path filtering which I corrected before continuing. Interestingly the installer points out that there is a change in the asm_diskstring with its implications.

One thing I haven’t recorded here (because I am using Oracle Linux 6.5 with UEK3) is the requirement for using a 2.6.39 kernel – that sounds like UEK2 to me.

Update: my system is Oracle Linux 6.5, not Red Hat. See Sigrid’s comments below: for Red Hat Linux there doesn’t seem to be a similar requirement to use UEK 2, which matches the documentation (Installation guide for Grid Infrastructure/Linux).

Another interesting case was that the kernel_core pattern wasn’t equal on all nodes. Turned out that 2 nodes had the package abrt installed, and the other two didn’t. Once the packages were installed on all nodes, the warning went away.

GI 009

Unfortunately I didn’t take a print screen of the summary in case you wonder where that is. I went straight into the installation phase:

GI 010

At the end of which you are prompted to run the upgrade scripts. Remember to run them in screen and pay attention to the order you run them in.

GI 011

The output from the last node is shown here:

[root@rac12node3 ~]# /u01/app/12.1.0.2/grid/rootupgrade.sh
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME=  /u01/app/12.1.0.2/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The file "oraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/12.1.0.2/grid/crs/install/crsconfig_params
2014/07/26 16:15:51 CLSRSC-4015: Performing install or upgrade action for Oracle
Trace File Analyzer (TFA) Collector.

2014/07/26 16:19:58 CLSRSC-4003: Successfully patched Oracle Trace File Analyzer (TFA) Collector.

2014/07/26 16:20:02 CLSRSC-464: Starting retrieval of the cluster configuration data

2014/07/26 16:20:51 CLSRSC-465: Retrieval of the cluster configuration data has successfully completed.

2014/07/26 16:20:51 CLSRSC-363: User ignored prerequisites during installation

ASM configuration upgraded in local node successfully.

2014/07/26 16:21:16 CLSRSC-466: Starting shutdown of the current Oracle Grid Infrastructure stack

2014/07/26 16:22:51 CLSRSC-467: Shutdown of the current Oracle Grid Infrastructure stack has successfully
completed.

OLR initialization - successful
2014/07/26 16:26:53 CLSRSC-329: Replacing Clusterware entries in file 'oracle-ohasd.conf'

CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Oracle High Availability Services has been started.
2014/07/26 16:34:34 CLSRSC-343: Successfully started Oracle Clusterware stack

clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 12c Release 1.
Successfully taken the backup of node specific configuration in OCR.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
2014/07/26 16:35:55 CLSRSC-478: Setting Oracle Clusterware active version on the last node to be upgraded

2014/07/26 16:35:55 CLSRSC-482: Running command: '/u01/app/12.1.0.2/grid/bin/crsctl set crs activeversion'

Started to upgrade the Oracle Clusterware. This operation may take a few minutes.
Started to upgrade the CSS.
The CSS was successfully upgraded.
Started to upgrade Oracle ASM.
Started to upgrade the CRS.
The CRS was successfully upgraded.
Successfully upgraded the Oracle Clusterware.
Oracle Clusterware operating version was successfully set to 12.1.0.2.0
2014/07/26 16:38:51 CLSRSC-479: Successfully set Oracle Clusterware active version

2014/07/26 16:39:13 CLSRSC-476: Finishing upgrade of resource types

2014/07/26 16:39:26 CLSRSC-482: Running command: 'upgrade model  -s 12.1.0.1.0 -d 12.1.0.2.0 -p last'

2014/07/26 16:39:26 CLSRSC-477: Successfully completed upgrade of resource types

2014/07/26 16:40:17 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

Did you notice that TFA has been added? Trace File Analyzer is another of these cool things to play with, it was available with 11.2.0.4 and as an add-on to 12.1.0.1.

Result!

Back to OUI to complete the upgrade. After which cluvfy performs a final check and I’m done. Prove it worked:

[oracle@rac12node1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Jul 26 17:13:02 2014

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE	12.1.0.2.0	Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production

SQL>

In another post I’ll detail the upgrade for my databases. I am particularly interested about the unplug/plug way of migrating…

Posted in 12c Release 1, KVM, Linux, RAC | Tagged: , | 6 Comments »

Implications of threaded_execution = true in 12c

Posted by Martin Bach on July 21, 2014

I had an interesting discussion as part of my latest presentation at the UKOUG RAC CIA & Database Combined SIG. Part of my talk was about the implications of the new threaded execution model in Oracle.

Since “we do not use Windows” (except for gaming) I can’t compare the Windows thread model to the new 12c implementation on UNIX. There are however interesting implications when switching to the new model, some of which I’d like to demonstrate here. First of all, threaded execution is not enabled by default. With 12.1.0.1.3 on top of Oracle Restart you get the either all or a subset of the following background and auxiliary processes for a CDB:

[oracle@london ~]$ ps -eLf | egrep -i "UID|$ORACLE_SID"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle    6041     1  6041  0    1 08:56 ?        00:00:00 ora_pmon_CDB1
oracle    6043     1  6043  0    1 08:56 ?        00:00:00 ora_psp0_CDB1
oracle    6045     1  6045  2    1 08:56 ?        00:00:08 ora_vktm_CDB1
oracle    6049     1  6049  0    1 08:56 ?        00:00:00 ora_gen0_CDB1
oracle    6051     1  6051  0    1 08:56 ?        00:00:00 ora_mman_CDB1
oracle    6055     1  6055  0    1 08:56 ?        00:00:00 ora_diag_CDB1
oracle    6057     1  6057  0    1 08:56 ?        00:00:00 ora_dbrm_CDB1
oracle    6059     1  6059  0    1 08:56 ?        00:00:00 ora_dia0_CDB1
oracle    6061     1  6061  0    1 08:56 ?        00:00:00 ora_dbw0_CDB1
oracle    6063     1  6063  0    1 08:56 ?        00:00:00 ora_lgwr_CDB1
oracle    6065     1  6065  0    1 08:56 ?        00:00:00 ora_ckpt_CDB1
oracle    6067     1  6067  0    1 08:56 ?        00:00:00 ora_lg00_CDB1
oracle    6069     1  6069  0    1 08:56 ?        00:00:00 ora_lg01_CDB1
oracle    6071     1  6071  0    1 08:56 ?        00:00:00 ora_smon_CDB1
oracle    6073     1  6073  0    1 08:56 ?        00:00:00 ora_reco_CDB1
oracle    6075     1  6075  0    1 08:56 ?        00:00:00 ora_lreg_CDB1
oracle    6077     1  6077  0    1 08:56 ?        00:00:00 ora_rbal_CDB1
oracle    6079     1  6079  0    1 08:56 ?        00:00:00 ora_asmb_CDB1
oracle    6081     1  6081  0    1 08:56 ?        00:00:00 ora_mmon_CDB1
oracle    6083     1  6083  0    1 08:56 ?        00:00:00 ora_mmnl_CDB1
grid      6086     1  6086  0    1 08:56 ?        00:00:00 oracle+ASM_asmb_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6087     1  6087  0    1 08:56 ?        00:00:00 ora_d000_CDB1
oracle    6089     1  6089  0    1 08:56 ?        00:00:00 ora_s000_CDB1
oracle    6092     1  6092  0    1 08:56 ?        00:00:00 ora_mark_CDB1
oracle    6094     1  6094  0    1 08:56 ?        00:00:00 ora_ocf0_CDB1
oracle    6109     1  6109  0    1 08:56 ?        00:00:00 ora_o000_CDB1
grid      6111     1  6111  0    1 08:56 ?        00:00:00 oracle+ASM_ocf0_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      6113     1  6113  0    1 08:56 ?        00:00:00 oracle+ASM_o000_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6118     1  6118  0    1 08:56 ?        00:00:00 ora_nss2_CDB1
oracle    6148     1  6148  0    1 08:56 ?        00:00:01 ora_p000_CDB1
oracle    6150     1  6150  0    1 08:56 ?        00:00:02 ora_p001_CDB1
oracle    6154     1  6154  0    1 08:56 ?        00:00:00 ora_o001_CDB1
grid      6156     1  6156  0    1 08:56 ?        00:00:00 oracle+ASM_o001_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6158     1  6158  0    1 08:56 ?        00:00:00 ora_tmon_CDB1
oracle    6160     1  6160  0    1 08:56 ?        00:00:00 ora_arc0_CDB1
oracle    6162     1  6162  0    1 08:56 ?        00:00:00 ora_arc1_CDB1
oracle    6164     1  6164  0    1 08:56 ?        00:00:00 ora_arc2_CDB1
oracle    6166     1  6166  0    1 08:56 ?        00:00:00 ora_arc3_CDB1
oracle    6169     1  6169  0    1 08:57 ?        00:00:00 ora_tt00_CDB1
oracle    6171     1  6171  0    1 08:57 ?        00:00:00 ora_smco_CDB1
oracle    6176     1  6176  0    1 08:57 ?        00:00:00 ora_aqpc_CDB1
oracle    6180     1  6180  0    1 08:57 ?        00:00:00 ora_p002_CDB1
oracle    6182     1  6182  0    1 08:57 ?        00:00:00 ora_p003_CDB1
oracle    6184     1  6184  0    1 08:57 ?        00:00:00 ora_p004_CDB1
oracle    6186     1  6186  0    1 08:57 ?        00:00:00 ora_p005_CDB1
oracle    6188     1  6188  0    1 08:57 ?        00:00:00 ora_p006_CDB1
oracle    6190     1  6190  0    1 08:57 ?        00:00:00 ora_p007_CDB1
oracle    6192     1  6192  0    1 08:57 ?        00:00:00 ora_qm02_CDB1
oracle    6194     1  6194  0    1 08:57 ?        00:00:00 ora_qm00_CDB1
oracle    6196     1  6196  0    1 08:57 ?        00:00:00 ora_q002_CDB1
oracle    6200     1  6200  0    1 08:57 ?        00:00:00 ora_q004_CDB1
oracle    6202     1  6202  0    1 08:57 ?        00:00:00 ora_q005_CDB1
oracle    6204     1  6204  0    1 08:57 ?        00:00:00 ora_q006_CDB1
oracle    6206     1  6206  0    1 08:57 ?        00:00:00 ora_q007_CDB1
oracle    6208     1  6208  0    1 08:57 ?        00:00:00 ora_q008_CDB1
oracle    6210     1  6210  0    1 08:57 ?        00:00:00 ora_q009_CDB1
oracle    6212     1  6212  0    1 08:57 ?        00:00:00 ora_q00a_CDB1
oracle    6292     1  6292  0    1 08:57 ?        00:00:00 ora_w000_CDB1
oracle    6318     1  6318  0    1 08:57 ?        00:00:00 ora_cjq0_CDB1
oracle    6320     1  6320  0    1 08:57 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6322     1  6322  0    1 08:57 ?        00:00:01 oracleCDB1 (LOCAL=NO)
oracle    6324     1  6324  0    1 08:57 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    6326     1  6326  0    1 08:57 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    6407     1  6407  0    1 08:57 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    6438     1  6438  0    1 08:57 ?        00:00:00 ora_p008_CDB1
oracle    6440     1  6440  0    1 08:57 ?        00:00:00 ora_p009_CDB1
oracle    6442     1  6442  0    1 08:57 ?        00:00:01 oracleCDB1 (LOCAL=NO)
oracle    6490     1  6490  0    1 08:58 ?        00:00:00 ora_p00a_CDB1
oracle    6492     1  6492  0    1 08:58 ?        00:00:00 ora_p00b_CDB1
oracle    6640  5653  6640  0    1 09:01 pts/0    00:00:00 egrep -i UID|CDB1

Quite a few more than in 9i it seems. Most of the background processes are documented in the database reference. As you can see, there are only processes (-> NLWP = 1), no threads. If you wondered about the strange output, bear with me, there is a reason I chose those flags to the ps command.

The change with the threaded model

Let’s enable threaded execution. It’s simple:

SYS@CDB$ROOT> alter system set threaded_execution=true scope=spfile;

System altered.

SYS@CDB$ROOT> shutdown immediate

Database closed.
Database dismounted.
ORACLE instance shut down.
SYS@CDB$ROOT> startup
ORA-01017: invalid username/password; logon denied

There’s the first surprise. Not to worry, it’s just that OS level authentication doesn’t work anymore, but you can continue by providing your username and password:

SYS@CDB$ROOT> conn sys as sysdba
Enter password:
Connected.
SYS@CDB$ROOT> alter database mount;

Database altered.

SYS@CDB$ROOT> alter database open;

Database altered.

SYS@CDB$ROOT>

What does that mean for our processes? First let’s use the old way of checking for Oracle processes using just ps -ef:

[oracle@london ~]$ ps -ef | egrep -i "UID|$ORACLE_SID"
UID        PID  PPID  C STIME TTY          TIME CMD
oracle    7138     1  0 09:04 ?        00:00:00 ora_pmon_CDB1
oracle    7140     1  0 09:04 ?        00:00:00 ora_psp0_CDB1
oracle    7142     1  2 09:04 ?        00:00:05 ora_vktm_CDB1
oracle    7146     1  0 09:04 ?        00:00:01 ora_u004_CDB1
oracle    7152     1  4 09:04 ?        00:00:08 ora_u005_CDB1
oracle    7158     1  0 09:04 ?        00:00:00 ora_dbw0_CDB1
grid      7173     1  0 09:04 ?        00:00:00 oracle+ASM_asmb_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7202     1  0 09:04 ?        00:00:00 oracle+ASM_ocf0_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7226     1  0 09:04 ?        00:00:00 oracle+ASM_o000_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7229     1  0 09:04 ?        00:00:00 oracle+ASM_o001_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7401     1  0 09:06 ?        00:00:00 oracle+ASM_o002_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7407     1  0 09:06 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7494     1  0 09:06 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7526     1  0 09:07 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7573     1  1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7575     1  3 09:07 ?        00:00:02 oracleCDB1 (LOCAL=NO)
oracle    7578     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7592     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7594     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7596     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7691  5653  0 09:08 pts/0    00:00:00 egrep -i UID|CDB1
[oracle@london ~]$

The list is considerably shorter, and you will notice some strange process names: u004 and u005. If you squint you see that PMON, PSP0, VKTM and DBW0 are still processes. I am checking for threads (or light weight processes as they are also called) in the next step:

[oracle@london ~]$ ps -eLf | egrep -i "UID|$ORACLE_SID"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle    7138     1  7138  0    1 09:04 ?        00:00:00 ora_pmon_CDB1
oracle    7140     1  7140  0    1 09:04 ?        00:00:00 ora_psp0_CDB1
oracle    7142     1  7142  2    1 09:04 ?        00:00:11 ora_vktm_CDB1
oracle    7146     1  7146  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7147  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7148  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7149  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7155  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7159  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7160  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7161  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7162  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7163  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7165  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7166  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7167  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7152     1  7152  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7153  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7154  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7156  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7164  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7168  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7169  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7170  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7172  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7174  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7175  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7224  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7227  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7249  0   46 09:05 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7361  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7362  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7365  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7366  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7367  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7396  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7397  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7404  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7405  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7416  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7418  0   46 09:06 ?        00:00:01 ora_u005_CDB1
oracle    7152     1  7419  0   46 09:06 ?        00:00:02 ora_u005_CDB1
oracle    7152     1  7420  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7421  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7422  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7423  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7424  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7425  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7426  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7427  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7428  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7430  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7431  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7432  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7433  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7434  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7435  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7576  0   46 09:07 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7618  0   46 09:07 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7619  0   46 09:07 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7696  0   46 09:08 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7697  0   46 09:08 ?        00:00:00 ora_u005_CDB1
oracle    7158     1  7158  0    1 09:04 ?        00:00:00 ora_dbw0_CDB1
grid      7173     1  7173  0    1 09:04 ?        00:00:00 oracle+ASM_asmb_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7226     1  7226  0    1 09:04 ?        00:00:00 oracle+ASM_o000_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7229     1  7229  0    1 09:04 ?        00:00:00 oracle+ASM_o001_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7401     1  7401  0    1 09:06 ?        00:00:00 oracle+ASM_o002_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7494     1  7494  0    1 09:06 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7526     1  7526  0    1 09:07 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7573     1  7573  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7575     1  7575  1    1 09:07 ?        00:00:02 oracleCDB1 (LOCAL=NO)
oracle    7578     1  7578  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7592     1  7592  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7594     1  7594  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7596     1  7596  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7810     1  7810  0    1 09:10 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7840  5653  7840  0    1 09:11 pts/0    00:00:00 egrep -i UID|CDB1

This is where it’s interesting. First let’s match this to the database, focusing on the background processes.

SYS@CDB$ROOT> select pname, pid, sosid, spid, stid, execution_type
  2  from v$process where background = 1
  3  order by pname
  4  /

PNAME        PID SOSID                    SPID                     STID                     EXECUTION_
----- ---------- ------------------------ ------------------------ ------------------------ ----------
AQPC          43 7152_7416                7152                     7416                     THREAD
ARC0          35 7152_7362                7152                     7362                     THREAD
ARC1          36 7152_7365                7152                     7365                     THREAD
ARC2          37 7152_7366                7152                     7366                     THREAD
ARC3          38 7152_7367                7152                     7367                     THREAD
ASMB          22 7146_7167                7146                     7167                     THREAD
CJQ0          68 7152_7576                7152                     7576                     THREAD
CKPT          15 7146_7160                7146                     7160                     THREAD
DBRM          11 7146_7155                7146                     7155                     THREAD
DBW0          13 7158                     7158                     7158                     PROCESS
DIA0          12 7152_7156                7152                     7156                     THREAD
DIAG           9 7152_7154                7152                     7154                     THREAD
GEN0           5 7146_7148                7146                     7148                     THREAD
LG00          16 7146_7161                7146                     7161                     THREAD
LG01          18 7146_7163                7146                     7163                     THREAD
LGWR          14 7146_7159                7146                     7159                     THREAD
LREG          20 7146_7165                7146                     7165                     THREAD
MARK          28 7152_7175                7152                     7175                     THREAD
MMAN           7 7146_7149                7146                     7149                     THREAD
MMNL          24 7152_7169                7152                     7169                     THREAD
MMON          23 7152_7168                7152                     7168                     THREAD
NSS2          33 7152_7249                7152                     7249                     THREAD
O002          40 7152_7397                7152                     7397                     THREAD
PMON           2 7138                     7138                     7138                     PROCESS
PSP0           3 7140                     7140                     7140                     PROCESS
Q002          54 7152_7428                7152                     7428                     THREAD
Q004          56 7152_7430                7152                     7430                     THREAD
Q005          57 7152_7431                7152                     7431                     THREAD
Q006          58 7152_7432                7152                     7432                     THREAD
Q007          59 7152_7433                7152                     7433                     THREAD
Q008          60 7152_7434                7152                     7434                     THREAD
Q009          61 7152_7435                7152                     7435                     THREAD
QM00          53 7152_7427                7152                     7427                     THREAD
QM02          44 7152_7426                7152                     7426                     THREAD
RBAL          21 7146_7166                7146                     7166                     THREAD
RECO          19 7152_7164                7152                     7164                     THREAD
SCMN           6 7146_7146                7146                     7146                     THREAD
SCMN          10 7152_7152                7152                     7152                     THREAD
SMCO          41 7152_7404                7152                     7404                     THREAD
SMON          17 7146_7162                7146                     7162                     THREAD
TMON          34 7152_7361                7152                     7361                     THREAD
TT00          39 7152_7396                7152                     7396                     THREAD
VKTM           4 7142                     7142                     7142                     PROCESS
W000          42 7152_7405                7152                     7405                     THREAD

44 rows selected.

Here you clearly see which one of the background processes is a thread, and which isn’t. The threads will have multiple STIDs per SPID, or thread IDs to the process:

SYS@CDB$ROOT> select count(spid),spid,execution_type from v$process where background = 1 group by spid, execution_type;

COUNT(SPID) SPID                     EXECUTION_
----------- ------------------------ ----------
          1 7158                     PROCESS
         12 7146                     THREAD
          1 7138                     PROCESS
          1 7140                     PROCESS
          1 7142                     PROCESS
         29 7152                     THREAD

6 rows selected.

Circling back to the OS level for SPID 7146-can you see the threads there as well? You can:

[oracle@london ~]$ ps -eLf | egrep -i "UID|7146"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle    7146     1  7146  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7147  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7148  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7149  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7155  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7159  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7160  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7161  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7162  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7163  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7165  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7166  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7167  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    8156  6876  8156  0    1 09:19 pts/2    00:00:00 egrep -i UID|7146
[oracle@london ~]$

But what about user sessions?

Let’s connect to the database (a PDB in this case) as a user.

[oracle@london ~]$ sqlplus martin@localhost/pdb4
...
SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 09:22:39 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Fri Jul 18 2014 09:21:27 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@PDB4> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
280
MARTIN@PDB4>

I can see the session from the CDB root:

SYS@CDB$ROOT> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 280 and s.paddr = p.addr;

USERNAME          SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
---------- ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN            280        407          4 8672                     8672                     8672                     PROCESS

SYS@CDB$ROOT>

Interestingly – as the output shows – user processes seem to remain UNIX processes (there is a twist to that too, bear with me). Remember that this process has been created using a net service and the EZConnect syntax. What about my session that bypasses the net*8 layer? I connected using sqlplus sys as sysdba:

SYS@CDB$ROOT> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
------------------------------------------------------------------------------------------------------------------------------------------------------
249

SYS@CDB$ROOT> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 249 and s.paddr = p.addr;

USERNAME          SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
---------- ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
SYS               249        685          1 7152                     7152_8696                8696                     THREAD

SYS@CDB$ROOT>

Now that’s a thread. Surprised? Let’s test on a non CDB. First I try to connect by bypassing the listener:

[oracle@london ~]$ sqlplus martin

SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 10:34:58 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Fri Jul 18 2014 10:33:25 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@NCDB> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
249

MARTIN@NCDB>
-- in a different session
SYS@NCDB> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 249 and s.paddr = p.addr;

USERNAME                              SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
------------------------------ ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN                                249         13          0 14634                    14634_15337              15337                    THREAD

This session is a thread. Now using the listener:

[oracle@london ~]$ tnsping ncdb

TNS Ping Utility for Linux: Version 12.1.0.1.0 - Production on 18-JUL-2014 10:36:33

Copyright (c) 1997, 2013, Oracle.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = london.example.com)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = NCDB)))
OK (0 msec)
[oracle@london ~]$ sqlplus martin@ncdb

SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 10:36:36 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Fri Jul 18 2014 10:36:18 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@NCDB> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
249

And what’s the guess?

SYS@NCDB> r
  1  select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3* where s.sid = 249 and s.paddr = p.addr

USERNAME                              SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
------------------------------ ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN                                249         17          0 15507                    15507                    15507                    PROCESS

You were right-a process. But so far I haven’t shown you my listener.ora file. So far in the testing it didn’t use DEDICATED_THROUGH_BROKER_listener = ON. Let’s set this, and reload the listener. Here is the complete example:

[oracle@london ~]$ sqlplus martin@ncdb

SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 10:45:18 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Fri Jul 18 2014 10:36:38 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@NCDB> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
14

-- and in another session

SYS@NCDB> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 14 and s.paddr = p.addr;

USERNAME               SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
--------------- ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN                  14         25          0 14634                    14634_16287              16287                    THREAD

The listener parameter DEDICATED_THROUGH_BROKER_<listener_name> allows you to create user sessions as threads. And now to wrap up, how do you kill that session? On my Oracle Linux 6.5 system I killed the LWP:

[oracle@london ~]$ ps -eLf | egrep -i "UID|16287"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle   14634     1 16287  0   32 10:45 ?        00:00:00 ora_u005_NCDB
oracle   16531 16503 16531  1    1 10:48 pts/4    00:00:00 egrep -i UID|16287
[oracle@london ~]$ kill -9 16287

Which terminates my session:

MARTIN@NCDB> select user from dual
  2  /
select user from dual
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 14634 Thread ID: 16287
Session ID: 14 Serial number: 25

My database stayed up and running. Please do NEVER EVER kill the SPID when threaded_execution is enabled!!! You might bring the database down.

Disclaimer

As with everything on this blog, don’t take it literally-test it! Just because it worked for me on my VM with the specific set of packages does not mean this is applicable for other Linux or even Solaris/AIX!

Posted in 12c Release 1, Linux | Tagged: , | 2 Comments »

Setting up Direct NFS on Oracle 12c

Posted by Martin Bach on July 9, 2014

Direct NFS is a great feature that I have finally had the time to investigate further. Since I always forget how to set it up and I didn’t find blog posts about this subject elsewhere I decided to put something together.

In this configuration I am using a virtual machine named server1 to export a directory to server2. Oracle is not as lenient as myself and may have certain support requirements when it comes to dNFS servers but I just wanted to get started.

The export of the NFS mount is shown here:

[root@server1 ~]# cat /etc/exports
/u01/oraback	*.example.com(rw,sync,all_squash,insecure,anonuid=54321,anongid=54321)

There is nothing too special about the export definition here. The all_squash directive normally uses a uid and gid of 65534. Since this is most likely not matching the oracle user I chose to override this behaviour. You may have already guessed that I am using the oracle preinstall RPMs which use 54321 for the oracle account and oinstall group respectively. The “all_squash” directive maps all uids and gids to the anonymous user by default.

Oracle also recommends tweaking some network related parameters in /etc/sysctl.conf. They appear to have to do with better performance, but I didn’t have time to verify that claim yet. On the other hand, it seems that from kernel 2.6 onwards Linux tunes the send and receive buffers automatically, and it’s probably a good idea not trying to outsmart it.

Using “service nfs start” I started the NFS server process on my Oracle Linux 6.4 system. I had appropriate firewall rules in place, if you have a firewall then you might need to do the same.

Mounting

Mounting the file system on the second node requires a setting in the fstab file to begin with. In this example I want to mount the exported backups in /media/backups on server2.

The corresponding entry in fstab was:

...
server1:/u01/oraback	/media/backup	nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,vers=3,timeo=600	1 2

I kept the habit of enforcing NFSv3 even though 4 and 4.1 are now supported with 12c. With the fstab entry I can now mount the directory:

[root@server2 ~]# mount /media/backup
[root@server2 ~]# mount | grep backup
server1:/u01/oraback on /media/backup type nfs (rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,vers=3,timeo=600,addr=192.168.56.44)

So far kernel NFS has been used, but not direct NFS. To do so, you need to create a mapping file named oranfstab. The file resides in $ORACLE_HOME/dbs. In this file you define how you get to the NFS filer. You can define up to 4 paths if you have multiple NICs in your database server. My VM does not so the most basic setup is used:

[root@server2 ~]# cat /u01/app/oracle/product/12.1.0.1/dbhome_1/dbs/oranfstab
server: server1
local: 192.168.56.45
path:  192.168.56.44
export: /u01/oraback mount: /media/backup
export: /m/CLONE mount: /u01/oradata/CLONE
[root@server2 ~]#

The “server” directive gives the NFS filer a name. Multiple sections are introduced using the “server” keyword.

Following the server keyword you define how you get the data: the “local” IP address shown here is the IP of server2. The “path” keyword indicates the path to the NFS filer, server1 or 192.168.56.44. And finally you tell Oracle the name of the export on the filer (/u01/oraback) and where it is mounted locally (/media/backup).

The last step necessary is to enable dNFS:

[oracle@server2 lib]$ make -f ins_rdbms.mk dnfs_on
rm -f /u01/app/oracle/product/12.1.0.1/dbhome_1/lib/libodm12.so;
cp /u01/app/oracle/product/12.1.0.1/dbhome_1/lib/libnfsodm12.so
 /u01/app/oracle/product/12.1.0.1/dbhome_1/lib/libodm12.so

All done! After the next start of your database you should see the a reference to the ODM library in the alert.log similar to this one:

...
NOTE: remote asm mode is local (mode 0x1; from cluster type)
Sun Nov 17 14:30:41 2013
Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 3.0
Starting background process PMON
Starting background process PSP0
...

Hopefully you will now be able to query the metadata views too.

SQL> select * from v$dnfs_servers

        ID SVRNAME    DIRNAME                 MNTPORT    NFSPORT NFSVERSI      WTMAX      RTMAX     CON_ID
---------- ---------- -------------------- ---------- ---------- -------- ---------- ---------- ----------
         1 server1    /m/clone                  52690       2049 NFSv3.0     1048576    1048576          0

That’s all there is to say about dNFS for this post. Oh and if at first the query against v$dnfs_servers does not return anything, it doesn’t necessarily imply a problem. I just created a file on the mount point and – as if by magic – dNFS kicked in, opened the channel and opened the file descriptors.

References

MOS Doc ID 1464567.1: Collecting The Required Information For Support To Troubleshot DNFS (Direct NFS) Issues (11.1, 11.2 & 12c)

Posted in 12c Release 1, Linux | 3 Comments »

Why is P1 the only parameter populated in cell smart table scan?

Posted by Martin Bach on July 7, 2014

Anyone who has looked at Exadata might ask the question, and I did so too. After all, cell smart table scan is in wait class User IO so there should be more, right? This is what you find for a smart scan:

NAME                           PARAMETER1           PARAMETER2           PARAMETER3                     WAIT_CLASS
------------------------------ -------------------- -------------------- ------------------------------ ---------------
cell smart table scan          cellhash#                                                                User I/O
cell smart index scan          cellhash#                                                                User I/O

Compare this to the traditional IO request:

NAME                           PARAMETER1           PARAMETER2           PARAMETER3                     WAIT_CLASS
------------------------------ -------------------- -------------------- ------------------------------ ---------------
db file sequential read        file#                block#               blocks                         User I/O
db file scattered read         file#                block#               blocks                         User I/O
direct path read               file number          first dba            block cnt                      User I/O

At first I didn’t understand as to why that was the. Until tracing a lot more. Using the method described by my colleague Frits Hoogland I traced Exadata calls as well as the IB interconnect in addition to a SQL trace (level 8). Please don’t do this outside the lab, the trace will get huge!

Now that produces an interesting file, and the SQL trace information is only 1 intermittent line drowning in a sea of other information. So I used a sed oneliner to make them stand out a bit:

$ sed  "/WAIT/{x;p;x;G;}" db12c1_ora_11916.trc > kxd_libcell_sqltace.trc

This introduces a newline before and after each line containing WAIT. The SQL Trace information is familiar:

WAIT #139970814085680: nam='cell smart table scan' ela= 251 cellhash#=822451848 p2=0 p3=0 obj#=61471 tim=3470911250866

By the way the obj# is the data_object_ID, not the object_id in DBA_OBJECTS.

Stepping through the trace

The trace information reflects what is happening during the various smart scan processing steps. This first example shows you some of the processing during the smart scan initialisation phase, when the cells are being initialised. I initially struggled mapping the “appliance” hexadecimal number to an actual cell. But it’s easy if you grep for kcfis_open_appliance_fd, and match the cellhash to v$cell.cell_hashval. Continuing the example:

kcfis_open_appliance:
throttle: initialized for appliance 0x7f4d7e9bc4c0
kcfis_find_appliance_fd: appliance fd not found. appliance 192.168.12.10/ Cached appliance fd count 0
kcfis_open_appliance_fd: 0x7f4d7e9bc4c0
oss_open called for o/192.168.12.10//box/predicate
Sending message RQ_Tag_2827385864_11: 0xf9b3f08 with frag # 0, to box my_box_refid: 2 (box inc: 12)
Command with band 1, payload length 116, payload 0xf9b4110
RQ_Tag_2827385864_11: Command name OPEN, refid 11
Open flags: readwrite - 1, sync - 0, char - 1.
 Device name - /box/predicate
Vrpc completion - RQ_Tag_2827385864_11: cmd 0xf9b3f48, req 0xf9b3a40, frag # 1
Vrpc reply 0xf9b4000 with refid 11 for existing message: 0xf9b3f08
Reply with band 0, payload length 0, payload (nil)
RQ_Tag_2827385864_11: Command name REPLY OPEN, refid 11
Returned open file descriptor  1
opening file number 11

WAIT #139970814085680: nam='cell smart table scan' ela= 259 cellhash#=822451848 p2=0 p3=0 obj#=61471 tim=3470911167788

So here the cell with IP 192.168.12.10 maps to 0x7f4d7e9bc4c0 in the trace. As you can see, immediately before the wait event is written to the trace file a file is opened on the cell. This causes the line to be emitted in the SQL trace file. So far there hasn’t been any significant scanning activity against the data. But the trace line makes sense: the IO request was issued against the cell with hash 822451848. We don’t know what it did though, and which file it accessed.

SYS:db12c1> select cell_path from v$cell where cell_hashval = '822451848';

CELL_PATH
----------------------------
192.168.12.10

I was initially quite pleased with myself initially when I found this out.

More trace information. A lot more.

Lesson learned trawling through the trace: just because there is a single line in the trace file doesn’t mean there aren’t other requests either in flight or under preparation.

The next few thousand (!) lines show you the mapping between RDBMS and cell addressing. This part is well explained in Frits’ post I referenced earlier. Further down in the trace it gets more interesting from a SQL trace perspective again. Here payload maps are generated and issued against the cells. Remember that these IO requests to cells are performed asynchronously. I found quite a few calls to the code generating payloads for the cells between individual SQL Trace WAIT lines. This makes reading the trace a bit confusing, it almost appears as if the SQL Trace couldn’t keep up with the speed the requests are created/sent.

Looking at the file I found the following pattern preceding a smart scan line in the first half of the process (before data comes back from the cells). Beginning immediately after the last WAIT line, in abbreviated format:

Ioctl completed.
appliance 0x7f4d7e9bc4c0 : cellsrv pid: 9563: predicate: /box/predicate37843
...
kcfis_create_maps_payload. appliance 0x7f4d7ea65570 num_disks 12
throttle: allowing map push: appliance 0x7f4d7ea65570, maps_to_push 7
...
Pushing request : disknumber = 0, offset = 936606973952, len = 1032192 rdba: 43778306 version 0 reqid=10 cver=1
disk=0x7f4d7e9c0aa8 state=1
trans_req_element = 0x7f4d7e9cba00

The appliance name, cellsrv (thread) ID, and the box predicate appear to remain constant for the lifetime of the smart scan. You can find them in other occurrences of the trace as well (but not used in the same context).

The preceding lines are then followed by many more lines beginning with “Pushing request”, iterating over all 12 disks (-> num_disks), for a range of Data Block Addresses. After that map payload has been created it is time to push it to the cell. Here it gets a bit murky: the KXD.* traces show requests being pushed, but not necessarily to the cell mentioned in the event line. Consider this:

kcfis_create_maps_payload. alloc_len 272 num maps 2
throttle: mappayload: maps_to_push 4 iosize_being_pushed 8388608
kcfis_push: pushing maps to appliance 0x7f4d7e9bc4c0
kcfis_issue_ioctl: payload_type 2
oss_ioctl open file: 11, opcode: 99
Sending message RQ_Tag_2827385864_54: 0xf9c5728 with frag # 0, to box my_box_refid: 2 (box inc: 12)
Command with band 1, payload length 272, payload 0x7f4d7e9b9ef8
RQ_Tag_2827385864_54: Command name IOCTL, refid 54
Ioctl arguments fd 1 opcode 99 size 272
oss_ioctl done
oss_wait called for request: 0xf9b3a40
Vrpc completion - RQ_Tag_2341095748_52: cmd 0xf9c6ec8, req 0xf9b3a40, frag # 1
Vrpc reply 0xf9c6f80 with refid 52 for existing message: 0xf9c6e88
Reply with band 1, payload length 40, payload 0x7f4d7ea65a90
RQ_Tag_2341095748_52: Command name REPLY IOCTL, refid 52
oss_wait done, request 0xf9b3a40, result 0x7f4d837e4c48

WAIT #139970814085680: nam='cell smart table scan' ela= 44 cellhash#=3249924569 p2=0 p3=0 obj#=61471 tim=3470911270172

The kcfis_push call is addressed to appliance 0x7f4d7e9bc4c0, which maps to 192.168.12.10. The cell hash from the event (3249924569) maps to a different cell though. You can find the corresponding kcfis_push earlier in the trace. In fact, it is the first occurrence after the last WAIT event line.

trans_req_element = 0x7f4d7e870780
Pushing request : disknumber = 5, offset = 937011707904, len = 4194304 rdba: 44570624 version 0 reqid=1021 cver=1
disk=0x7f4d7e9b5980 state=1
disk=0x7f4d7e9b4968 state=1
disk=0x7f4d7e9b3460 state=1
disk=0x7f4d7e9b2f70 state=1
disk=0x7f4d7e9df5a0 state=1
disk=0x7f4d7e9df0b0 state=1
disk=0x7f4d7e9dea78 state=1
kcfis_create_maps_payload. alloc_len 2648 num maps 35
throttle: mappayload: maps_to_push 10 iosize_being_pushed 102039552
kcfis_push: pushing maps to appliance 0x7f4d7ea65570
kcfis_issue_ioctl: payload_type 2
SKGXP-connecting to the box 192.168.12.8 OSS OS Pid 4541
oss_ioctl open file: 2, opcode: 99
Sending message RQ_Tag_2341095748_52: 0xf9c6e88 with frag # 0, to box my_box_refid: 0 (box inc: 16)
Command with band 3, payload length 2648, payload 0x7f4d7ea62e98
RQ_Tag_2341095748_52: Command name IOCTL, refid 52
Ioctl arguments fd 1 opcode 99 size 2648
oss_ioctl done

This is getting too complex, can you say it again?

Now let’s put this together again, simplified:

WAIT #139970814085680: nam='cell smart table scan' ela= 24 cellhash#=822451848 p2=0 p3=0 obj#=61471 tim=3470911267234

kcfis_push: num-appliances 3. payload_size 0x7f4d837e4cf0 ioctl_issued 0x7f4d837e4cc8 results 0x7f4d837e4c48 payload_type 0x7f4d837e4d20
kcfis_create_maps_payload. appliance 0x7f4d7ea65570 num_disks 12
...
kcfis_create_maps_payload. alloc_len 2648 num maps 35
throttle: mappayload: maps_to_push 10 iosize_being_pushed 102039552
kcfis_push: pushing maps to appliance 0x7f4d7ea65570
kcfis_issue_ioctl: payload_type 2
SKGXP-connecting to the box 192.168.12.8 OSS OS Pid 4541
oss_ioctl open file: 2, opcode: 99
Sending message RQ_Tag_2341095748_52: 0xf9c6e88 with frag # 0, to box my_box_refid: 0 (box inc: 16)
Command with band 3, payload length 2648, payload 0x7f4d7ea62e98
RQ_Tag_2341095748_52: Command name IOCTL, refid 52
Ioctl arguments fd 1 opcode 99 size 2648
oss_ioctl done
...
kcfis_create_maps_payload. appliance 0x7f4d7e9bf518 num_disks 12
...
kcfis_create_maps_payload. alloc_len 1208 num maps 15
throttle: mappayload: maps_to_push 10 iosize_being_pushed 34471936
kcfis_push: pushing maps to appliance 0x7f4d7e9bf518
kcfis_issue_ioctl: payload_type 2
SKGXP-connecting to the box 192.168.12.9 OSS OS Pid 9547
...
oss_ioctl done
...
kcfis_create_maps_payload. appliance 0x7f4d7e9bc4c0 num_disks 12
...
kcfis_create_maps_payload. alloc_len 272 num maps 2
throttle: mappayload: maps_to_push 4 iosize_being_pushed 8388608
kcfis_push: pushing maps to appliance 0x7f4d7e9bc4c0
...
oss_ioctl done
...
oss_wait done, request 0xf9b3a40, result 0x7f4d837e4c48

WAIT #139970814085680: nam='cell smart table scan' ela= 44 cellhash#=3249924569 p2=0 p3=0 obj#=61471 tim=3470911270172

Interestingly the results for the requests come back in that order as well. It seems coincidence, judging by the rest of the trace. Compare the appliance values with the summary you just read:

WAIT #139970814085680: nam='cell smart table scan' ela= 44 cellhash#=3249924569 p2=0 p3=0 obj#=61471 tim=3470911270172

Ioctl completed. Payload type 2
Ioctl quarantine response 1 for appliance 0x7f4d7ea65570
appliance 0x7f4d7ea65570 : cellsrv pid: 4541: predicate: /box/predicate114
oss_wait called for request: 0xf9c69c0
Vrpc completion - RQ_Tag_2825762783_53: cmd 0xf9c6768, req 0xf9c69c0, frag # 1
Vrpc reply 0xf9c6820 with refid 53 for existing message: 0xf9c6728
Reply with band 1, payload length 40, payload 0x7f4d7e9bfa38
RQ_Tag_2825762783_53: Command name REPLY IOCTL, refid 53
oss_wait done, request 0xf9c69c0, result 0x7f4d837e4c68

WAIT #139970814085680: nam='cell smart table scan' ela= 64 cellhash#=674246789 p2=0 p3=0 obj#=61471 tim=3470911270269

Ioctl completed. Payload type 2
Ioctl quarantine response 1 for appliance 0x7f4d7e9bf518
appliance 0x7f4d7e9bf518 : cellsrv pid: 9547: predicate: /box/predicate37211
oss_wait called for request: 0xf9c5260
Vrpc completion - RQ_Tag_2827385864_54: cmd 0xf9c5768, req 0xf9c5260, frag # 1
Vrpc reply 0xf9c5820 with refid 54 for existing message: 0xf9c5728
Reply with band 1, payload length 40, payload 0x7f4d7e9bc9e0
RQ_Tag_2827385864_54: Command name REPLY IOCTL, refid 54
oss_wait done, request 0xf9c5260, result 0x7f4d837e4c88

WAIT #139970814085680: nam='cell smart table scan' ela= 54 cellhash#=822451848 p2=0 p3=0 obj#=61471 tim=3470911270351

Ioctl completed. Payload type 2
Ioctl quarantine response 1 for appliance 0x7f4d7e9bc4c0

SQL trace wait events frequently seem to be written after an IOCTL (standard Linux call) or oss_wait (Oracle) has completed.  So it appears as if requests are created and treated on a cell-by-cell basis. This explains why you get so many cell table smart scan events in general. It’s not because they are necessarily small in size, the payloads always seem to include more than 1 disk to scan. Thinking about it now it makes sense, but as an analyst I would have appreciated more insights into what happens under the covers. But we can trace, right?

Posted in 11g Release 2, 12c Release 1, Exadata | Leave a Comment »

Why does the Optimiser not respect my qb_name() hint?

Posted by Martin Bach on June 30, 2014

I recently was involved in an investigation on a slow-running report on an Exadata system. This was rather interesting, the raw text file with the query was 33kb in size, and SQL Developer formatted the query to > 1000 lines. There were lots of interesting constructs in the query and the optimiser did its best to make sense of the inline views and various joins.

This almost lead to a different post about the importance of understanding query transformation. You will see in a bit what QT has to do with this post. Although access paths and join methods are still very relevant when trying to understand execution plans you will be amazed by the extent of the work happening during query transformation. I have said so on twitter already, but in case you missed it: Oracle has published a SQL Tuning Guide with the 12c documentation set. This is a really, really good, well-written piece of documentation. And it contains a lot of information about query transformation in chapter 5.

Query Blocks

One of the lessons I learned when looking at complex SQL statements is to think about Query Blocks. Let’s start with an example. I created some test tables, T1 and T2, both with rather random data. T1 is bigger, about 33 million rows, T2 has about 1 million rows. All of them are based on Jonathan Lewis’s presentation about “creating tests“. This is Oracle 12.1.0.1.3 on Exadata.

Here is the statement used for this blog post, unhinted. To keep things simple there are no indexes on T1 or T2.

select /*+ gather_plan_statistics */
  /* test0013 */
 t1.id,t1.N1
from t1
where exists ( select 1 from t2 where t1.id = t2.id and t2.id between 10 and 20)
and t1.n1 = (select n1 from t1 where id = 1 and rownum = 1)   -- the subquery would return 32 rows without the count stopkey
union all
select  t2.id, t2.n1 from t2
where t2.id between 10 and 12
union all
select t2.id, t2.n1 from t2
where t2.id between 13 and 15;

The statement does not make sense for humans, but it should be logically correct (this is my first post on the optimiser, I am hoping that the more knowledgeable readers point out any inaccuracies should there be any). The intent is to introduce a few query blocks.

This results in the following execution plan:

SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last +alias'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  5zrwsqnwjzzry, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   /* test0013 */  t1.id,t1.N1 from
t1 where exists ( select 1 from t2 where t1.id = t2.id and t2.id
between 10 and 20) and t1.n1 = (select n1 from t1 where id = 1 and
rownum = 1)  union all select  t2.id, t2.n1 from t2 where t2.id between 10
and 12 union all select t2.id, t2.n1 from t2 where t2.id between 13 and
15

Plan hash value: 319705157

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |      |      1 |        |      6 |00:00:00.07 |     516K|    516K|       |       |          |
|   1 |  UNION-ALL                               |      |      1 |        |      6 |00:00:00.07 |     516K|    516K|       |       |          |
|*  2 |   HASH JOIN SEMI                         |      |      1 |      1 |      0 |00:00:00.07 |     485K|    485K|  1421K|  1421K|  540K (0)|
|*  3 |    TABLE ACCESS STORAGE FULL             | T1   |      1 |      1 |      0 |00:00:00.07 |     485K|    485K|  1025K|  1025K| 3085K (0)|
|*  4 |     COUNT STOPKEY                        |      |      1 |        |      1 |00:00:00.01 |       4 |      2 |       |       |          |
|*  5 |      TABLE ACCESS STORAGE FULL FIRST ROWS| T1   |      1 |      2 |      1 |00:00:00.01 |       4 |      2 |  1025K|  1025K| 1029K (0)|
|*  6 |    TABLE ACCESS STORAGE FULL             | T2   |      0 |     12 |      0 |00:00:00.01 |       0 |      0 |  1025K|  1025K|          |
|*  7 |   TABLE ACCESS STORAGE FULL              | T2   |      1 |      4 |      3 |00:00:00.01 |   15204 |  15200 |  1025K|  1025K| 3085K (0)|
|*  8 |   TABLE ACCESS STORAGE FULL              | T2   |      1 |      4 |      3 |00:00:00.01 |   15204 |  15200 |  1025K|  1025K| 3085K (0)|
-----------------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SET$1
   2 - SEL$5DA710D3
   3 - SEL$5DA710D3 / T1@SEL$1
   4 - SEL$3
   5 - SEL$3        / T1@SEL$3
   6 - SEL$5DA710D3 / T2@SEL$2
   7 - SEL$4        / T2@SEL$4
   8 - SEL$5        / T2@SEL$5

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID")
   3 - storage(("T1"."ID"<=20 AND "T1"."ID">=10 AND "T1"."N1"=))
       filter(("T1"."ID"<=20 AND "T1"."ID">=10 AND "T1"."N1"=))
   4 - filter(ROWNUM=1)
   5 - storage("ID"=1)
       filter("ID"=1)
   6 - storage(("T2"."ID"<=20 AND "T2"."ID">=10))
       filter(("T2"."ID"<=20 AND "T2"."ID">=10))
   7 - storage(("T2"."ID"<=12 AND "T2"."ID">=10))
       filter(("T2"."ID"<=12 AND "T2"."ID">=10))
   8 - storage(("T2"."ID"<=15 AND "T2"."ID">=13))
       filter(("T2"."ID"<=15 AND "T2"."ID">=13))

54 rows selected.

So far so good, nothing really too surprising here but keep in mind this is a simple example. Now if I wanted to pass hints to the optimiser on how to deal with the subqueries I have to refer to the query blocks. Jonathan Lewis has described this in an older but still relevant blog post, and you can find presentations from others on the subject as well. Easy enough, I simply add a /*+ qb_name(name) */ immediately after every select, provide the query block name in the hint (/*+ full(@block_a t1@block_a) */) and I’m done. So let’s try:

select /*+ qb_name(block_a) gather_plan_statistics */
  /* test0013 */
 t1.id,t1.N1
from t1
where exists ( select /*+ qb_name(block_c) */ 1 from t2 where t1.id = t2.id and t2.id between 10 and 20)
and t1.n1 = (select /*+ qb_name(block_d) */ n1 from t1 where id = 1 and rownum = 1)
union all
select /*+ qb_name(block_e) */ t2.id, t2.n1 from t2
where t2.id between 10 and 12
union all
select /*+ qb_name(block_f) */ t2.id, t2.n1 from t2
where t2.id between 13 and 15;

Now surely I should get query blocks block_[ac-f] in the execution plan, right? When going over the post I noticed that block_b was missing in the SQL statement by the way, for some reasons it went missing, never mind.

SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last +alias'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  66ytd4g9xmwxb, child number 0
-------------------------------------
select /*+ qb_name(block_a) gather_plan_statistics */   /* test0012 */
t1.id,t1.N1 from t1 where exists ( select /*+ qb_name(block_c) */ 1
from t2 where t1.id = t2.id and t2.id between 10 and 20) and t1.n1 =
(select /*+ qb_name(block_d) */ n1 from t1 where id = 1 and rownum = 1)
union all select /*+ qb_name(block_e) */ t2.id, t2.n1 from t2 where
t2.id between 10 and 12 union all select /*+ qb_name(block_f) */ t2.id,
t2.n1 from t2 where t2.id between 13 and 15

Plan hash value: 319705157

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |      |      1 |        |      6 |00:00:00.04 |     516K|    516K|       |       |          |
|   1 |  UNION-ALL                               |      |      1 |        |      6 |00:00:00.04 |     516K|    516K|       |       |          |
|*  2 |   HASH JOIN SEMI                         |      |      1 |      1 |      0 |00:00:00.04 |     485K|    485K|  1421K|  1421K|  385K (0)|
|*  3 |    TABLE ACCESS STORAGE FULL             | T1   |      1 |      1 |      0 |00:00:00.04 |     485K|    485K|  1025K|  1025K| 3085K (0)|
|*  4 |     COUNT STOPKEY                        |      |      1 |        |      1 |00:00:00.01 |       4 |      2 |       |       |          |
|*  5 |      TABLE ACCESS STORAGE FULL FIRST ROWS| T1   |      1 |      2 |      1 |00:00:00.01 |       4 |      2 |  1025K|  1025K| 1029K (0)|
|*  6 |    TABLE ACCESS STORAGE FULL             | T2   |      0 |     12 |      0 |00:00:00.01 |       0 |      0 |  1025K|  1025K|          |
|*  7 |   TABLE ACCESS STORAGE FULL              | T2   |      1 |      4 |      3 |00:00:00.01 |   15204 |  15200 |  1025K|  1025K| 3085K (0)|
|*  8 |   TABLE ACCESS STORAGE FULL              | T2   |      1 |      4 |      3 |00:00:00.01 |   15204 |  15200 |  1025K|  1025K| 3085K (0)|
-----------------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SET$1
   2 - SEL$7CA7EA44
   3 - SEL$7CA7EA44 / T1@BLOCK_A
   4 - BLOCK_D
   5 - BLOCK_D      / T1@BLOCK_D
   6 - SEL$7CA7EA44 / T2@BLOCK_C
   7 - BLOCK_E      / T2@BLOCK_E
   8 - BLOCK_F      / T2@BLOCK_F

Surprise, surprise

Umm, where are the query blocks A and C (remember there is no hint for block_b)? I was hoping to get rid of all the system generated names, but no luck. Actually, that’s not quite right, and to understand this you have to look at the 10053 trace, which is mostly harmless. In the trace you find all the query block names, right at the beginning:

Registered qb: BLOCK_A 0x3bc2578 (HINT BLOCK_A)
---------------------
QUERY BLOCK SIGNATURE
---------------------
  signature (): qb_name=BLOCK_A nbfros=1 flg=0
    fro(0): flg=4 objn=61471 hint_alias="T1"@"BLOCK_A"

Registered qb: BLOCK_C 0x3bc5610 (HINT BLOCK_C)
---------------------
QUERY BLOCK SIGNATURE
---------------------
  signature (): qb_name=BLOCK_C nbfros=1 flg=0
    fro(0): flg=4 objn=61485 hint_alias="T2"@"BLOCK_C"

Registered qb: BLOCK_D 0xfedf8560 (HINT BLOCK_D)
---------------------
QUERY BLOCK SIGNATURE
---------------------
  signature (): qb_name=BLOCK_D nbfros=1 flg=0
    fro(0): flg=4 objn=61471 hint_alias="T1"@"BLOCK_D"

Registered qb: BLOCK_E 0xfedf7290 (HINT BLOCK_E)
---------------------
QUERY BLOCK SIGNATURE
---------------------
  signature (): qb_name=BLOCK_E nbfros=1 flg=0
    fro(0): flg=4 objn=61485 hint_alias="T2"@"BLOCK_E"

Registered qb: BLOCK_F 0xfedf5f50 (HINT BLOCK_F)
---------------------
QUERY BLOCK SIGNATURE
---------------------
  signature (): qb_name=BLOCK_F nbfros=1 flg=0
    fro(0): flg=4 objn=61485 hint_alias="T2"@"BLOCK_F"

Registered qb: SET$1 0xfedf4b28 (PARSER)

So why is there no BLOCK_A in the Query Block Name / Object Alias section of the plan? Now it’s time to scroll down (a lot) or to search for the query block from the execution plan (SEL$7CA7EA44 for example).

I found it in the query transformation part of the optimiser trace, more specifically in the cost-based query transformation part. There you can see how the optimiser tries various optimisations to the SQL statement you issued. Not really surprisingly the optimiser works on the EXISTS part trying to unnest it. Here’s the excerpt from the trace:

*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SET$1 (#1) that are valid to unnest.
Subquery removal for query block BLOCK_D (#5)
RSW: Not valid for subquery removal BLOCK_D (#5)
Subquery unchanged.
Subquery Unnesting on query block BLOCK_A (#4)SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block BLOCK_A (#4).
SU:   Checking validity of unnesting subquery BLOCK_C (#6)
SU:   Passed validity checks.
SU:   Transforming EXISTS subquery to a join.
Registered qb: SEL$7CA7EA44 0xfedfa350 (SUBQUERY UNNEST BLOCK_A; BLOCK_C)
---------------------
QUERY BLOCK SIGNATURE
---------------------
  signature (): qb_name=SEL$7CA7EA44 nbfros=2 flg=0
    fro(0): flg=0 objn=61471 hint_alias="T1"@"BLOCK_A"
    fro(1): flg=0 objn=61485 hint_alias="T2"@"BLOCK_C"

Notice the new query block name: Registered qb: SEL$7CA7EA44 0xfedfa350 (SUBQUERY UNNEST BLOCK_A; BLOCK_C). And there you go, the rewritten part of the SQL statement “replaces” the original blocks A and C.

Many thanks go to Karen Morton who helped me along the way and had so many good suggestions!

Posted in 12c Release 1, Exadata, Linux | 5 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 2,317 other followers