Martins Blog

Trying to explain complex things in simple terms

Exadata and Virtual Private Database: will it offload my query?

Posted by Martin Bach on October 14, 2014

During one of the classes I taught about Exadata optimisations I had an interesting question:

If I am using VPD, will Exadata still offload the query?

Background is that we discussed function offloading, and the meta-view v$sqlfn_metadata. It turned out that SYS_CONTEXT() is not offloadable in 11.2.0.4.

SQL> select name,datatype,analytic,aggregate,offloadable,descr
  2  from v$sqlfn_metadata where name = 'SYS_CONTEXT';

NAME                                               DATATYPE ANA AGG OFF DESCR
-------------------------------------------------- -------- --- --- --- ------------------------------
SYS_CONTEXT                                        UNKNOWN  NO  NO  NO  SYS_CONTEXT

Since I’m a great fan of the Tom Kyte method (don’t say it is so, show it!) I needed a quick example. Of all the Oracle books I read “Effective Oracle by Design” was among the most inspiring.

Where to check?

My first idea was to check v$sqlfn_metadata to see if the ever present SYS_CONTEXT() was offloadable:

SQL> select name,offloadable
  2  from v$sqlfn_metadata
  3  where name = 'SYS_CONTEXT';

NAME                    OFF
----------------------- ---
SYS_CONTEXT             NO

SYS:dbm011> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE	11.2.0.4.0	Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production

OK so it does not appear to be offloadable. Does it matter? As you will see, it does not although it could startle you at first.

Reproduce a test case: 11.2.0.4/11.2.3.3.1.140529.1

After finishing the example from the Oracle documentation I decided that I needed a few more rows in the table to get to a more realistic data distribution for an Exadata system. You wouldn’t see a smart scan on a row with < 10 rows. This can be done quite easily, and the end result was:

SYS> select count(*) from scott.orders_tab;

   COUNT(*)
-----------
   12000003

1 row selected.

I had to ensure that the query is offloaded first-simplez! Kerry Osborne has a script for this:

SYS> select /*+ gather_plan_statistics rowcount_sys_stats */
  2  count(*) from scott.orders_tab;

   COUNT(*)
-----------
   12000003

1 row selected.

Elapsed: 00:00:00.81

SYS:OGGSRC> @scripts/fsx
Enter value for sql_text: %rowcount_sys_stats%
Enter value for sql_id:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
8bv5b04mjku08	   0  1093340548      1        .81	0 Yes	       38.47 select /*+ gather_plan_statistics rowcount_sys_stats */ count(*) from

1 row selected.

SYS:OGGSRC> @scripts/dplan
Enter value for sql_id: 8bv5b04mjku08
Enter value for child_no:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	8bv5b04mjku08, child number 0
-------------------------------------
select /*+ gather_plan_statistics rowcount_sys_stats */ count(*) from
scott.orders_tab

Plan hash value: 1093340548

---------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |       |  7521 (100)|          |
|   1 |  SORT AGGREGATE            |            |     1 |            |          |
|   2 |   TABLE ACCESS STORAGE FULL| ORDERS_TAB |    12M|  7521   (1)| 00:01:31 |
---------------------------------------------------------------------------------

15 rows selected.

So I guess it is.

Enter VPD

With all the VPD bells and whistles turned on I connected as on the the users for which there is an explicit context set and ran my query:

TBROOKE> select /*+ gather_plan_statistics tbrooke_query_001 */
  2  count(*) from scott.orders_tab;

   COUNT(*)
-----------
          4

1 row selected.

Elapsed: 00:00:00.15

1 row selected.

Which shows that the VPD policy works. What do my tools indicate?

SYS> @scripts/fsx
Enter value for sql_text: %tbrooke_query_001%
Enter value for sql_id:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
bs9gq0bdqazzu	   0  1093340548      1        .15	0 Yes	       99.98 select /*+ gather_plan_statistics tbrooke_query_001 */ count(*) from s

1 row selected.

SYS> @scripts/dplan
Enter value for sql_id: bs9gq0bdqazzu
Enter value for child_no: 0

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	bs9gq0bdqazzu, child number 0
-------------------------------------
select /*+ gather_plan_statistics tbrooke_query_001 */ count(*) from
scott.orders_tab

Plan hash value: 1093340548

-----------------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |       |       |  7651 (100)|          |
|   1 |  SORT AGGREGATE            |            |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| ORDERS_TAB |     1 |     6 |  7651   (3)| 00:01:32 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))
       filter("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))


21 rows selected.

So using the script it is easily visible that a smart scan has happened and in fact it saved 99.98% of IO. Which is not surprising giving that only 4 rows out of the whole result set have been returned. The Real Time SQL Monitor Report confirmed the finding by the way. I had to sneak in another hint (+monitor) otherwise the statement wouldn’t be captured in SQL Monitor (only “long running” statements are captured by default)

SQL Monitoring Report

SQL Text
------------------------------
select /*+ gather_plan_statistics monitor tbrooke_query_002 */ count(*) from scott.orders_tab

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TBROOKE (33:239)
 SQL ID              :  8ydqam3fuwt2z
 SQL Execution ID    :  16777216
 Execution Started   :  10/14/2014 04:28:24
 First Refresh Time  :  10/14/2014 04:28:24
 Last Refresh Time   :  10/14/2014 04:28:24
 Duration            :  .100139s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb01.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1

Global Stats
====================================================================================
| Elapsed |   Cpu   |    IO    | PL/SQL  | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) | Time(s) | Calls |  Gets  | Reqs | Bytes | Offload |
====================================================================================
|    0.10 |    0.01 |     0.09 |    0.00 |     1 |  27352 |  228 | 214MB |  99.98% |
====================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1093340548)
=========================================================================================================================================================================
| Id |          Operation           |    Name    |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   |  Mem  | Activity | Activity Detail |
|    |                              |            | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload | (Max) |   (%)    |   (# samples)   |
=========================================================================================================================================================================
|  0 | SELECT STATEMENT             |            |         |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  1 |   SORT AGGREGATE             |            |       1 |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  2 |    TABLE ACCESS STORAGE FULL | ORDERS_TAB |       1 | 7651 |         1 |     +0 |     1 |        4 |  228 | 214MB |  99.98% |    2M |          |                 |
=========================================================================================================================================================================

This confirms that a full table scan happened, and it must have been executed as a direct path read. A DPR on Exadata most oftent transforms into a smart scan. As you can see I didn’t specify any other predicate, and yet the VPD was offloaded.

Reproduce a test case: 12.1.0.2.0/12.1.1.1.1.140712

Interestingly in 12.1.0.2.0 the function SYS_CONTEXT is offloadable:

SQL> select name, offloadable
  2  from v$sqlfn_metadata where name = 'SYS_CONTEXT';

NAME                           OFF
------------------------------ ---
SYS_CONTEXT                    YES

1 row selected.

For completeness sake I have repeated my test here. I copied the table via network link and created the same users and VPD. Not too much of a difference. Information is shown here without comments:

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE    12.1.0.2.0      Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production


TBROOKE> select /*+ gather_plan_statistics tbrooke_query_013 */ count(*) from scott.orders_tab;

  COUNT(*)
----------
         4

Elapsed: 00:00:00.02

SQL> @scripts/fsx
Enter value for sql_text: %tbrooke_query_013%
Enter value for sql_id:

SQL_ID         CHILD  PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
bf0s5hzr7x9r5      0 1093340548      1        .02      0 Yes          99.97 select /*+ gather_plan_statistics tbrooke_query_013 */ count(*) from s

1 row selected.

SQL> @scripts/dplan
Enter value for sql_id: bf0s5hzr7x9r5
Enter value for child_no: 0

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  bf0s5hzr7x9r5, child number 0
-------------------------------------
select /*+ gather_plan_statistics tbrooke_query_013 */ count(*) from
scott.orders_tab

Plan hash value: 1093340548

-----------------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |       |       |     2 (100)|          |
|   1 |  SORT AGGREGATE            |            |     1 |    13 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| ORDERS_TAB |     1 |    13 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))
       filter("CUST_NO"=TO_NUMBER(SYS_CONTEXT('orders_ctx','cust_no')))


21 rows selected.

SQL Monitoring Report

SQL Text
------------------------------
select /*+ gather_plan_statistics monitor tbrooke_query_014 */ count(*) from scott.orders_tab

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TBROOKE (1042:32766)
 SQL ID              :  55yy67scgw2sf
 SQL Execution ID    :  16777216
 Execution Started   :  10/14/2014 05:13:30
 First Refresh Time  :  10/14/2014 05:13:30
 Last Refresh Time   :  10/14/2014 05:13:30
 Duration            :  .016025s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1

Global Stats
==================================================================================================
| Elapsed |   Cpu   |    IO    | Application | PL/SQL  | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Time(s) | Calls |  Gets  | Reqs | Bytes | Offload |
==================================================================================================
|    0.02 |    0.01 |     0.01 |        0.00 |    0.00 |     1 |  27441 |  223 | 214MB |  99.99% |
==================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1093340548)
=========================================================================================================================================================================
| Id |          Operation           |    Name    |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   |  Mem  | Activity | Activity Detail |
|    |                              |            | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload | (Max) |   (%)    |   (# samples)   |
=========================================================================================================================================================================
|  0 | SELECT STATEMENT             |            |         |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  1 |   SORT AGGREGATE             |            |       1 |      |         1 |     +0 |     1 |        1 |      |       |         |       |          |                 |
|  2 |    TABLE ACCESS STORAGE FULL | ORDERS_TAB |       1 |    2 |         1 |     +0 |     1 |        4 |  223 | 214MB |  99.99% |    3M |          |                 |
=========================================================================================================================================================================

In the last example flash cache and storage indexes attributed greatly to the quick execution time. Using Adrian Billington’s mystats I can see more detail. I have removed what’s not needed from the report.

SQL> @scripts/mystats stop t=1

==========================================================================================
MyStats report : 14-OCT-2014 05:27:44
==========================================================================================


------------------------------------------------------------------------------------------
1. Summary Timings
------------------------------------------------------------------------------------------

Type    Statistic Name                                                               Value
------  ----------------------------------------------------------------  ----------------
TIMER   snapshot interval (seconds)                                                  13.59
TIMER   CPU time used (seconds)                                                       0.01


------------------------------------------------------------------------------------------
2. Statistics Report
------------------------------------------------------------------------------------------

Type    Statistic Name                                                               Value
------  ----------------------------------------------------------------  ----------------
STAT    CPU used by this session                                                         7
STAT    CPU used when call started                                                       7
STAT    DB time                                                                          9
...
STAT    cell IO uncompressed bytes                                               3,031,040
STAT    cell blocks helped by minscn optimization                                      370
STAT    cell blocks processed by cache layer                                           370
STAT    cell blocks processed by data layer                                            370
STAT    cell blocks processed by txn layer                                             370
STAT    cell flash cache read hits                                                      10
STAT    cell num smartio automem buffer allocation attempts                              1
STAT    cell physical IO bytes eligible for predicate offload                  224,763,904
STAT    cell physical IO bytes saved by storage index                          221,732,864
STAT    cell physical IO interconnect bytes                                          5,360
STAT    cell physical IO interconnect bytes returned by smart scan                   5,360
STAT    cell scans                                                                       1
...
STAT    consistent gets                                                             27,816
STAT    consistent gets direct                                                      27,437
STAT    consistent gets examination                                                      3
STAT    consistent gets examination (fastpath)                                           3
STAT    consistent gets from cache                                                     379
STAT    consistent gets pin                                                            376
STAT    consistent gets pin (fastpath)                                                 376
STAT    db block changes                                                               787
STAT    db block gets                                                                2,879
STAT    db block gets from cache                                                     2,879
...
STAT    logical read bytes from cache                                           26,689,536
...
STAT    physical read bytes                                                    224,763,904
STAT    physical read requests optimized                                               223
STAT    physical read total IO requests                                                223
STAT    physical read total bytes                                              224,763,904
STAT    physical read total bytes optimized                                    224,763,904
STAT    physical read total multi block requests                                       215
STAT    physical reads                                                              27,437
STAT    physical reads direct                                                       27,437
...
STAT    table fetch by rowid                                                             1
STAT    table scan blocks gotten                                                         3
STAT    table scan disk non-IMC rows gotten                                          1,685
STAT    table scan rows gotten                                                       1,685
STAT    table scans (direct read)                                                        1
STAT    table scans (long tables)                                                        1
...

------------------------------------------------------------------------------------------
3. About
------------------------------------------------------------------------------------------
- MyStats v2.01 by Adrian Billington (http://www.oracle-developer.net)
- Based on the SNAP_MY_STATS utility by Jonathan Lewis

==========================================================================================
End of report
==========================================================================================

In the case of VPD the fact that SYS_CONTEXT() is offloadable did not play a major role.

Posted in 11g Release 2, 12c Release 1, Exadata | Tagged: , | Leave a Comment »

Little things worth knowing-troubleshooting parallel statement queueing

Posted by Martin Bach on October 9, 2014

This is yet another one of these posts that hopefully help you as much as they are going to help me in the future. Recently I enjoyed troubleshooting a problem related to parallel execution. Since I have never really written down how to tackle such a problem I thought it might be nice to do that now.

This is 12.1.0.2.0 on Exadata, but the platform doesn’t really matter for the troubleshooting technique.

What is parallel statement queueing

Statement queueing is a pretty nifty tool new with 11.2. Instead of downgrading the requested Degree of Parallelism (DOP) as it was the case with parallel_adaptive_multi_user your session will queue until the requested number of parallel servers becomes available. The rationale behind this feature is that you might end up queueing for 1 minute until all the requested PX servers become available, but once they are available your session can finish in 2 minutes with a DOP of 16 rather than 8 minutes with a (old behaviour) downgraded DOP of 2 for example. The 1 minute was well spent waiting. If you want to fine tune statement queueing, database resource manager allows you to.

The downside is that you don’t get statement queueing without enabling automatic DOP unless you play with underscore parameters. Automatic DOP is quite often the reason why you see statement queueing in the first place.

How does this statement queueing work?

With parallel_degree_policy set to AUTO (warning again-this can be dangerous!) the optimiser estimates the time it takes to execute a SQL statement during the hard parse. If the estimated execution time exceeds parallel_min_time_threshold (default is AUTO which is 10 seconds) then the optimiser will try to work out the “optimal” DOP which is calculated as min(parallel_degree_limit,  calculated ideal DOP). The default parallel_degree_limit is set to CPU, which is documented as follows: “The maximum degree of parallelism is limited by the number of CPUs in the system”. Ooops-that can be a lot. It might be better to set the parallel_degree_limit to an integer, signifying the maximum allowed DOP for a statement executing in parallel. On Exadata you don’t normally need very high degrees of parallelism, so a more conservative settings might be in order.

Queueing

When does the queueing mechanism kick in? Let’s imagine that you have parallel_max_servers = parallel_servers_target = 48 defined in your system, in my case an Exadata X2 with 2s12c24t Westmere EP processors. This isn’t the default by the way. This setting allows you to execute queries 48-way parallel per instance. Let’s assume that 40 query slaves are in use already, and another user wants to execute a query 12-way parallel. The sum of currently executing parallel slaves + the requested number of slaves exceeds parallel_servers_target, triggering statement queueing for your session.

Wolfgang it!

To create a reproducible test case I set parallel_max_servers to 48 and parallel_servers_target to 24. I also enabled automatic DOP in my session. My test queries are directed at a 256 million rows table of approximately 80GB in size, non partitioned. Before executing the first SQL statement I enabled event 10053 to Wolfgang my query.

SQL> alter session set events '10053 trace name context forever, level 1';

Session altered.

Elapsed: 00:00:00.02
SQL> desc bigtab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                                 NUMBER
 V1                                                 VARCHAR2(40)
 N1                                                 NUMBER
 N2                                                 NUMBER
 N_256K                                             NUMBER
 N_128K                                             NUMBER
 N_8K                                               NUMBER
 PADDING                                            VARCHAR2(960)

SQL> select 1 + sum(n1) + sum (n2) - sum(n_8k) from bigtab where id > 1;

1+SUM(N1)+SUM(N2)-SUM(N_8K)
---------------------------
                 1.0471E+13

Elapsed: 00:00:35.59

The 10053 trace should give me some idea about the DOP. Let’s have a look at it:

     36 **************************
     37 Automatic degree of parallelism (AUTODOP)
     38 **************************
     39 Automatic degree of parallelism is enabled for this statement in auto mode.
     40 kkopqSetForceParallelProperties: Hint:no
     41 Query: compute:yes forced:no forceDop:0
     42 Calibration statistics is enabled.
     43 Start with a serial pass, cost the DOP to use

   1935 AUTO DOP PLANS EVALUATION
   1936 ***************************************
   1937 Compilation completed with Dop: 1.
   1938   Cost_io:  2776823.000000  Cost_cpu: 154935265907
   1939   Card:     1.000000  Bytes:    19.000000
   1940   Cost:     2781428.613003  Est_time:  401168ms
   1941 kkopqCombineDop: Dop:40 Hint:no
   1942 Query: compute:yes forced:no  scanDop:40 cpuDop:1 forceDop:0
   1943 Serial plan is expensive enough to be a candidate for parallelism (2781429)
   1944 AutoDOP: kkopqAdjustDop: dop is 40 after affinity normalization
   1945 Signal reparse with DOP 40.
   1946 *****************************
   1947 Number of Compilations tried: 1
   1948 *****************************

   3853 AUTO DOP PLANS EVALUATION
   3854 ***************************************
   3855 Compilation completed with Dop: 40.
   3856   Cost_io:  77133.972222  Cost_cpu: 2387701756
   3857   Card:     1.000000  Bytes:    19.000000
   3858   Cost:     77204.949157  Est_time:  11135ms
   3859 kkopqCombineDop: Dop:80 Hint:no
   3860 Query: compute:yes forced:no  scanDop:40 cpuDop:80 forceDop:0
   3861 AutoDOP: kkopqAdjustDop: computed dop before applying degree limit was 80 and now it is 48
   3862 AutoDOP: kkopqAdjustDop: dop is 48 after affinity normalization
   3863 Adopt new plan with dop=40
   3864 Signal reparse with DOP 48.
   3865 *****************************
   3866 Number of Compilations tried: 2
   3867 *****************************

   5772 AUTO DOP PLANS EVALUATION
   5773 ***************************************
   5774 Compilation completed with Dop: 48.
   5775   Cost_io:  64278.310185  Cost_cpu: 1989751463
   5776   Card:     1.000000  Bytes:    19.000000
   5777   Cost:     64337.457631  Est_time:  9279ms
   5778 kkopqCombineDop: Dop:80 Hint:no
   5779 Query: compute:yes forced:no  scanDop:40 cpuDop:80 forceDop:0
   5780 Adopt new plan with dop=48
   5781 kkopqSetDopReason: Reason why we chose this DOP is: degree limit.
   5782 Costing completed. DOP chosen: 48.
   5783 AutoDop: kkopqSetMaxDopInCursorEnv:In the Cursor Session Env, max DOP is set to 48
   5784 *****************************
   5785 Number of Compilations tried: 3
   5786 *****************************

   5811 ============
   5812 Plan Table
   5813 ============
   5814 --------------------------------------------------+-----------------------------------+-------------------------+
   5815 | Id  | Operation                       | Name    | Rows  | Bytes | Cost  | Time      |  TQ  |IN-OUT|PQ Distrib |
   5816 --------------------------------------------------+-----------------------------------+-------------------------+
   5817 | 0   | SELECT STATEMENT                |         |       |       |   63K |           |      |      |           |
   5818 | 1   |  SORT AGGREGATE                 |         |     1 |    19 |       |           |      |      |           |
   5819 | 2   |   PX COORDINATOR                |         |       |       |       |           |      |      |           |
   5820 | 3   |    PX SEND QC (RANDOM)          | :TQ10000|     1 |    19 |       |           |:Q1000| P->S |QC (RANDOM)|
   5821 | 4   |     SORT AGGREGATE              |         |     1 |    19 |       |           |:Q1000| PCWP |           |
   5822 | 5   |      PX BLOCK ITERATOR          |         |  276M | 5248M |   63K |  00:02:09 |:Q1000| PCWC |           |
   5823 | 6   |       TABLE ACCESS STORAGE FULL | BIGTAB  |  276M | 5248M |   63K |  00:02:09 |:Q1000| PCWP |           |
   5824 --------------------------------------------------+-----------------------------------+-------------------------+
   5825 Predicate Information:
   5826 ----------------------
   5827 6 - storage(:Z>=:Z AND :Z<=:Z AND "ID">1)
   5828 6 - filter("ID">1)
   5829
   5830 Content of other_xml column
   5831 ===========================
   5832   derived_cpu_dop: 80
   5833   derived_io_dop : 40
   5834   dop_reason     : degree limit
   5835   dop            : 48
   5836   px_in_memory_imc: no
   5837   px_in_memory   : no
   5838   io_rate        : 200
   5839   derived_io_dop : 40
   5840   cpu_rate       : 1000
   5841   derived_cpu_dop: 52
   5842   cpu_rate       : 1000
   5843   derived_cpu_dop: 29
   5844   cpu_rate       : 1000
   5845   derived_cpu_dop: 29
   5846   db_version     : 12.1.0.2
   5847   parse_schema   : MARTIN
   5848   dynamic_sampling: 11
   5849   plan_hash_full : 1216263238
   5850   plan_hash      : 4050997022
   5851   plan_hash_2    : 1216263238
   5852   Outline Data:
   5853   /*+
   5854     BEGIN_OUTLINE_DATA
   5855       IGNORE_OPTIM_EMBEDDED_HINTS
   5856       OPTIMIZER_FEATURES_ENABLE('12.1.0.2')
   5857       DB_VERSION('12.1.0.2')
   5858       OPT_PARAM('optimizer_dynamic_sampling' 11)
   5859       ALL_ROWS
   5860       SHARED(48)
   5861       OUTLINE_LEAF(@"SEL$1")
   5862       FULL(@"SEL$1" "BIGTAB"@"SEL$1")
   5863     END_OUTLINE_DATA
   5864   */

There you have it! The v$sql_plan.other_xml column contains all the necessary information-no need for a 10053 trace :). The fanciest way I can come up with to get that information is to use this statement:

select t.*
 from v$sql_plan v,
  xmltable(
	'/other_xml/info'
	passing xmltype(v.other_xml)
	columns
		info_type varchar2(30) path '@type',
		info_value varchar2(30) path '/info'
  ) t
where v.sql_id = '&sql_id'
  and v.child_number = &child_number
  and other_xml is not null;

INFO_TYPE                      INFO_VALUE
------------------------------ ------------------------------
derived_cpu_dop                51
derived_io_dop                 40
dop_reason                     degree limit
dop                            48
px_in_memory_imc               no
px_in_memory                   no
io_rate                        200
derived_io_dop                 40
cpu_rate                       1000
derived_cpu_dop                26
cpu_rate                       1000
derived_cpu_dop                26
cpu_rate                       1000
derived_cpu_dop                26
db_version                     12.1.0.2
parse_schema                   "MARTIN"
plan_hash_full                 1216263238
plan_hash                      4050997022
plan_hash_2                    1216263238

19 rows selected.

But I digress… By the way-this statement used 48 PX server processes because it was the only one active at the time.

Causing trouble

After hitting publish I noticed that I copied the wrong example…

An example working better for the purpose of demonstrating the effect of queueing uses a more realistic case where statements use different DOPs. Two sessions are enough, one that will use 12 slaves, and another that requests 36. The statement with a DOP of 12 is started first, the one with a DOP of 36 a few seconds later. In a third session I gathered some diagnostic information:

SYS:dbm011> select px.sid, px.serial#,px.qcsid,px.qcserial#,px.qcinst_id,px.degree,px.req_degree,
  2  s.username, s.sql_id, s.sql_child_number, s.event, s.state
  3  from v$px_session px, v$session s
  4  where s.sid = px.sid
  5    and s.serial# = px.serial#
  6  order by px.qcsid;

        SID     SERIAL#       QCSID   QCSERIAL#   QCINST_ID      DEGREE  REQ_DEGREE USERNAME             SQL_ID        SQL_CHILD_NUMBER EVENT                                                            STATE
----------- ----------- ----------- ----------- ----------- ----------- ----------- -------------------- ------------- ---------------- ---------------------------------------------------------------- --------------------
        265       14828         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED SHORT TIME
        330       21719         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        395       59827         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED SHORT TIME
        461       17363         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        525        4955         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        592        2687         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        526       60805         526                                                 MARTIN               9md3ukhghgj58                0 PX Deq: Execute Reply                                            WAITING
        720       27626         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        786       47540         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED KNOWN TIME
        850       41110         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITED SHORT TIME
        918       46836         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        983       12699         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING
        657       22654         526       60805           1          12          12 MARTIN               9md3ukhghgj58                0 cell smart table scan                                            WAITING

13 rows selected.

SYS:dbm011> select count(qcsid),qcsid from v$px_session group by qcsid;

COUNT(QCSID)       QCSID
------------ -----------
          13         526

SYS:dbm011>
SYS:dbm011> select count(event),event from v$session where username = 'MARTIN' group by event;

COUNT(EVENT) EVENT
------------ ----------------------------------------------------------------
          12 cell smart table scan
           1 PX Deq: Execute Reply
           1 resmgr:pq queued

SYS:dbm011>

So my first session is well underway, 12 way parallel as requested. The other session had to wait but kicked into motion as soon as my first session finished.

SYS:dbm011> select px.sid, px.serial#,px.qcsid,px.qcserial#,px.qcinst_id,px.degree,px.req_degree,
  2  s.username, s.sql_id, s.sql_child_number, s.event, s.state
  3  from v$px_session px, v$session s
  4  where s.sid = px.sid
  5    and s.serial# = px.serial#
  6  order by px.qcsid;

        SID     SERIAL#       QCSID   QCSERIAL#   QCINST_ID      DEGREE  REQ_DEGREE USERNAME             SQL_ID        SQL_CHILD_NUMBER EVENT                                                            STATE
----------- ----------- ----------- ----------- ----------- ----------- ----------- -------------------- ------------- ---------------- ---------------------------------------------------------------- --------------------
        265       43575         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        525       17778         525                                                 MARTIN               4dz3dapm4uy8q                0 PX Deq: Execute Reply                                            WAITING
        395       10504         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        462       57813         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        526       12648         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        591       42643         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        657        5236         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        720       19243         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        786       61636         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        850       17734         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
        918       60016         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        983       24489         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1044       41056         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1108        8936         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1178       52303         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
       1242       37008         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1307       28798         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
       1372       22172         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
       1437       59990         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
       1501        9307         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
          6       39400         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
         72       28569         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        137       18368         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        202       56832         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        333        6482         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        397       28334         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        461       17550         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        524         958         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        593       41235         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        656       51686         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
        719       40346         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED SHORT TIME
        785        5157         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        851       35243         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITED KNOWN TIME
        917       45635         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        980       18504         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
       1046       27779         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING
        330       10724         525       17778           1          36          36 MARTIN               4dz3dapm4uy8q                0 cell smart table scan                                            WAITING

37 rows selected.

SYS:dbm011>
SYS:dbm011> select count(qcsid),qcsid from v$px_session group by qcsid;

COUNT(QCSID)       QCSID
------------ -----------
          37         525

SYS:dbm011>
SYS:dbm011> select count(event),event from v$session where username = 'MARTIN' group by event;

COUNT(EVENT) EVENT
------------ ----------------------------------------------------------------
          36 cell smart table scan
           1 PX Deq: Execute Reply

The effect of queueing can be seen in the extended execution time:

  • 12 way parallel-no queueing: Elapsed: 00:00:07.16
  • 36 way parallel-with a bit of queueing: Elapsed: 00:00:12.92

V$PX_SESSION allows you to identify the culprit causing other sessions to queue-have a look at it and see if you can optimise the statement or run it later. This should unblock you. Also have a look at parallel_servers_target- you might have limited yourself too much. I personally don’t exceed the core/thread count on a box by too much to avoid scheduling problems.

Posted in 11g Release 2, 12c Release 1 | Leave a Comment »

Interesting observation about standby redo logs in Data Guard

Posted by Martin Bach on September 22, 2014

Some of you might have followed the discussion around the number of standby redo logs on twitter, but since 140 characters are woefully short for the complete story here’s the writeup that prompted the question. This is a test with 12.1.0.2 on virtualised Linux, repeated on a proper platform with physical hardware.

First of all here’s my setup. I have a dbca-based database (CDB, but doesn’t matter) that features 3 groups for its online redo logs. They are all 50 MB in size-important for this test, but not realistic :) Following the Oracle documentation I created n + 1 groups (per thread) on the standby to stop Data Guard broker from complaining about missing standby redo logs (SRL).

The end result was positive, here’s what the broker thinks:

DGMGRL> show configuration

Configuration - test

  Protection Mode: MaxPerformance
  Members:
  CDB1  - Primary database
    STDBY - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 23 seconds ago)

The broker likes to bemoan you in case SRLs are missing on either primary and/or standby. So I’m all good to go! But wait a second, when I’m using one of the cool new 12c features to check the failover readiness, something is odd:

DGMGRL> validate database "STDBY"

  Database Role:     Physical standby database
  Primary Database:  CDB1

  Ready for Switchover:  Yes
  Ready for Failover:    Yes (Primary Running)

  Flashback Database Status:
    CDB1:   Off
    STDBY:  Off

  Current Log File Groups Configuration:
    Thread #  Online Redo Log Groups  Standby Redo Log Groups Status
              (CDB1)                  (STDBY)
    1         3                       2                       Insufficient SRLs

  Future Log File Groups Configuration:
    Thread #  Online Redo Log Groups  Standby Redo Log Groups Status
              (STDBY)                 (CDB1)
    1         3                       0                       Insufficient SRLs
    Warning: standby redo logs not configured for thread 1 on CDB1

DGMGRL>

Pardon me? I _do_ have 4 groups of SRLs:

SQL> r
  1* select group#,sequence#,bytes,used,status from v$standby_log

    GROUP#  SEQUENCE#      BYTES       USED STATUS
---------- ---------- ---------- ---------- ----------
         4         27   52428800       2048 ACTIVE
         5         29   52428800      11264 ACTIVE
         6          0   52428800          0 UNASSIGNED
         7          0   52428800          0 UNASSIGNED

And 3 online redo logs:

SQL> select group#,thread#,sequence#,bytes,status from v$log;

    GROUP#    THREAD#  SEQUENCE#      BYTES STATUS
---------- ---------- ---------- ---------- ----------------
         1          1         31   52428800 CURRENT
         2          1         29   52428800 INACTIVE
         3          1         30   52428800 INACTIVE

But the fun fact remains that out of my 4 standby redo logs, only 2 are ever used. @pioro suggested that it’s because there is no need to use any of the other two because of a lack of redo to be applied and that sounds plausible. To validate this I created a test environment on physical hardware with proper storage because my tests on the lab turned out to suffer from IO problems.

The setup is the same.

I created a couple of sessions that start off by creating a 1,000,000 rows table, then delete from it only to roll back after the delete completed. That’s a lot of redo for 50 MB files (again you wouldn’t use 50MB online redo logs in production-this is just a test). My observation remains: only groups 4 and 5 are used. When the redo generation on the primary gets too far ahead, there will be RFS transfers of the archived redo log. Here are a few queries with their output:

PROCESS   STATUS       CLIENT_P GROUP#                                      THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS
--------- ------------ -------- ---------------------------------------- ---------- ---------- ---------- -------------
ARCH      CLOSING      ARCH     5                                                 1        149      83968             0
ARCH      CONNECTED    ARCH     N/A                                               0          0          0             0
ARCH      CLOSING      ARCH     5                                                 1        147      86016             0
ARCH      CLOSING      ARCH     4                                                 1        148      83968             0
MRP0      APPLYING_LOG N/A      N/A                                               1        147      85766            25
RFS       IDLE         UNKNOWN  N/A                                               0          0          0             0
RFS       IDLE         UNKNOWN  N/A                                               0          0          0             0
RFS       RECEIVING    LGWR     3                                                 1        150      85744             0
RFS       IDLE         ARCH     N/A                                               0          0          0             0

Is this a problem? Not as far as I can tell. And maybe I did something incorrectly on my side too. There wasn’t a problem-my archiver process was quick enough to archive the SRLs and as soon as it fell behind it resorted to fetch archived redo logs from the primary. When it uses the FAL process you see line like this in the alert.log:

2014-09-17 02:34:45.164000 -05:00
RFS[4]: Selected log 4 for thread 1 sequence 211 dbid 788205474 branch 857548261
Media Recovery Log +RECO/STDBY/ARCHIVELOG/2014_09_17/thread_1_seq_209.368.858479683
Archived Log entry 141 added for thread 1 sequence 210 ID 0x2efbcba0 dest 1:
2014-09-17 02:34:46.215000 -05:00

When Real-Time-Apply was active (whenever redo transfer rates dropped), I saw this, alternating between group 4 and 5:

2014-09-17 02:35:33.653000 -05:00
Media Recovery Waiting for thread 1 sequence 230 (in transit)
Recovery of Online Redo Log: Thread 1 Group 5 Seq 230 Reading mem 0
  Mem# 0: +DATA/STDBY/ONLINELOG/group_5.380.858076769
  Mem# 1: +RECO/STDBY/ONLINELOG/group_5.342.858076769

So for some reason yet unknown to me there are only ever 2 SRLs in use.

Posted in 12c Release 1, Linux | 3 Comments »

Can you have high redundancy files in a normal redundancy diskgroup?

Posted by Martin Bach on September 11, 2014

One of the perks of teaching classes is that I get to research questions asked. In the last Exadata Administration Class I taught someone asked: can you have your disk groups in Exadata on normal redundancy yet have certain databases use high redundancy? This would be a good interview question …

The answer is yes, which I remembered from researching material on the 11g RAC book but I wanted to prove that it is the case.

Update: I planned a second blog post where I wanted to test the effect but Alex Fatkulin was quicker, and I promise I didn’t see his post when I wrote mine. Otherwise there probably wouldn’t have been one :) In summary, you aren’t really any better protected. The disk group remains at normal redundancy, even with the data files in high. Looking at Alex’s results (and I encourage you to do so) I concur with his summary that although you have a 3rd copy of the extent protecting you from corruption, you don’t have higher resilience.

This is not Exadata specific by the way. When I face a question around ASM and internals my first idea is to use an Internet search engine and look up the ASM work by Luca Canali. This page is almost always of relevance when looking at ASM file structures: https://twiki.cern.ch/twiki/bin/view/PDBService/ASM_Internals.

ASM Templates

I am not aware of many users of ASM who know about ASM templates, although we all make use of them. The templates define the striping and mirroring of files, and are explained in the ASM Admin Guide chapter 5. You are not limited to using the Oracle provided templates, you can create your own as well, which is key to this post. I headed over to the ASM instance on my cluster and created the template:

SQL> alter diskgroup data add template adminclassHigh attributes(high);

Diskgroup altered.

With the template in place I can create a high redundancy tablespace for example in my normal redundancy diskgroup:

SQL> select name, group_number, type from v$asm_diskgroup;

NAME                           GROUP_NUMBER TYPE
------------------------------ ------------ ------
RECO                                      3 NORMAL
DBFS_DG                                   2 NORMAL
DATA                                      1 NORMAL

Unlike what the documentation suggests you do not need to change db_create_file_dest for this to work.

SQL> show parameter db_create_file_dest

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_create_file_dest                  string      +DATA

Let’s put that template to use.

SQL> create tablespace testhigh datafile '+data(adminclassHigh)' size 10m;

Tablespace created.

SQL> select file_name from dba_data_Files;

FILE_NAME
-----------------------------------------------------------------------------------------------
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/system.363.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/sysaux.364.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/users.365.857664253
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/testhigh.357.857880753

Note that I specify the newly created template in () in the datafile clause. The tablespace has been created, next I need the file number and the incarnation to look up how the extents are distributed.

SQL> select DISK_KFFXP,AU_KFFXP,PXN_KFFXP,XNUM_KFFXP,LXN_KFFXP 
  2   from x$kffxp  where number_kffxp = 357 and incarn_kffxp = 857880753;

DISK_KFFXP   AU_KFFXP  PXN_KFFXP XNUM_KFFXP  LXN_KFFXP
---------- ---------- ---------- ---------- ----------
         4      73706          0          0          0
        23      73767          1          0          1
        24      28895          2          0          2
         1      73679          3          1          0
        27      30015          4          1          1
        22      73717          5          1          2
        21      73725          6          2          0
         7      73698          7          2          1
        33      73707          8          2          2

9 rows selected.

What looks a little cryptic can be translated using Luca’s notes as:

  • DISK is the ASM disk where the extent is located (as in V$ASM_DISK)
  • AU indicates the relative position of the allocation unit from the beginning of the disk
  • PXN is the progressive extent number ( = actual extent)
  • XNUM is the ASM file extent number (mirrored extent pairs have the same extent value)
  • LXN indicates the type of extent: 0 = primary extent, 1 = first mirror copy, 2 = second mirror copy

So you can see that each primary extent has 2 mirror copies, also known as high redundancy.

If you omit the template in the datafile clause you get what you’d expect: normal redundancy.

SQL> drop tablespace testhigh including contents and datafiles;

Tablespace dropped.

SQL> create tablespace testhigh datafile '+data' size 10m;

Tablespace created.

SQL>  select file_name from dba_data_Files;

FILE_NAME
-----------------------------------------------------------------------------------------------
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/system.363.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/sysaux.364.857664247
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/users.365.857664253
+DATA/PRI/02810A3FB7F65856E053D908A8C003E1/DATAFILE/testhigh.357.857880927

And on ASM:

SQL> select DISK_KFFXP,AU_KFFXP,PXN_KFFXP,XNUM_KFFXP,LXN_KFFXP
  2  from x$kffxp  where number_kffxp = 357 and incarn_kffxp = 857880927;

DISK_KFFXP   AU_KFFXP  PXN_KFFXP XNUM_KFFXP  LXN_KFFXP
---------- ---------- ---------- ---------- ----------
        30      73724          0          0          0
        16      73749          1          0          1
         0      73690          2          1          0
        28      73682          3          1          1
        17      73722          4          2          0
        10      73697          5          2          1

6 rows selected.

As you can see each extent has 1 mirror copy, not 2.

Posted in Automatic Storage Management | 1 Comment »

Intra-Database IORM in action

Posted by Martin Bach on September 10, 2014

I have been teaching the Enkitec Exadata Administration Class this week and made an interesting observation I thought was worth sharing with regards to IO Resource Management on Exadata.

I have created a Database Resource Manager (DBRM) Plan that specifically puts a resource consumer group to a disadvantage. Actually, quite severely so but the following shouldn’t be a realistic example in the first place: I wanted to prove a point. Hang-on I hear you say: you created a DBRM plan-the post has IORM in the subject though: what gives? Please allow me to explain.

Exadata offers 3 different ways to implement IORM to the keen engineer:

  • Intra-Database IORM
  • Inter-Database IORM
  • Category IORM

The latter 2 need to be implemented on the cells using the cellcli “alter IORMPLAN” directive, documented in the Storage Server User’s Guide in chapter 6. The first one though is pushed down to the cells when activated on the database, an activity that is visible in the cell alert.log. So when you create and enable a DBRM plan it will automatically become an IORM plan as well.

Setup

The code to create the DBRM plan is quite simple. I create a plan named ENKITEC_DBRM and two new consumer groups:

  • LOWPRIO_GROUP
  • HIGHPRIO_GROUP

HIGHPRIO_GROUP gets 100% of mgmt_p2 where LOWPRIO_GROUP gets 25% of the remaining resources on level mgmt_p3 and has to share them with the OTHER_GROUP as well. Not that you would do that in real life…. There was no change to the cpu_count, which is 24 on an X2 compute node (2s12c24t Westmere Xeon processors).

After the setup was complete I created a SQL script that I’m planning on executing in 50 sessions: 25 in low priority, and 25 in high priority sessions. The script is a simple “select count(*) from reallybigtable” – in this case an 80 GB table with 256 million rows, uncompressed.

Running SQLPLUS concurrently

I wrote a small shell script that is capable of launching a user-configurable number of SQLPLUS sessions against the database and included some timing information to measure the time-to-completion for all sessions. With that at hand I started 25 sessions for the low priority group and 25 for the high priority consumer group. In a third SSH-session I connected to the first cell in the quarter rack (it’s an X2 by the way) and repeatedly executed metric_iorm.pl. This little gem was written for cellserv 11.2.x and lists IORM metrics for each database in the system as well as a summary of IO activity against the cell. I explicitly mentioned cellsv 11.2.x because the new IORM instrumentation for PDBs is missing from the output. So far you can only realistically monitor non-CDBs with it when using RDBMS 12c. The perl script is available from “Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl (Doc ID 1337265.1)”. Deployed to celladmin’s home I was ready to start the test.

Test result

Executing metric_iorm.pl immediately after initiating the heavy IO load against the cells does not report a lot of useful data. It appears as if the contents of “metriccurrent” isn’t updated straight away but with a slight delay. Note there is no typical resource manager wait event such as “cpu quantum” here:

SQL> select username,event,state,sql_id from v$session where username like '%PRIO';

USERNAME                       EVENT                          STATE               SQL_ID
------------------------------ ------------------------------ ------------------- -------------
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITED KNOWN TIME   6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITED SHORT TIME   6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
HIGHPRIO                       cell smart table scan          WAITING             6q73x4ufb26pz
LOWPRIO                        cell smart table scan          WAITING             6q73x4ufb26pz

50 rows selected.

Almost all of the sessions were happily scanning the table, I am not aware of a way of finding out about IORM throttling on the RDBMS level but that doesn’t mean there is a way of finding out! In the end the sessions of LOWPRIO_GROUP took 320 seconds to scan the table, the sessions in HIGHPRIO_GROUP only needed 199 seconds to finish their scans of the table. So clearly IORM was at work here.

IO Performance

The end result was to be expected, but I wanted to know more. In this case the interesting bit is found on the cell. Please note that I only looked at the first cell in this X2 quarter rack, I didn’t check the combined throughput across all cells. The query I executed wasn’t making any use of Exadata features either, it was a brute force scan across all rows in the table in the absence of a where-clause. I just needed a lot of IO to demonstrate the use of IORM.

When all 50 sessions were active I could observe the following information from metric_iorm.pl for my database:

Database: DBM01
Utilization:     Small=0%    Large=97%
Flash Cache:     IOPS=45702
Disk Throughput: MBPS=0
Small I/O's:     IOPS=0.7    Avg qtime=8.8ms
Large I/O's:     IOPS=1115    Avg qtime=1941ms
        Consumer Group: HIGHPRIO_GROUP
        Utilization:     Small=0%    Large=94%
        Flash Cache:     IOPS=43474
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=0.0 
        Large I/O's:     IOPS=1064 
        Consumer Group: _ORACLE_BACKGROUND_GROUP_
        Utilization:     Small=0%    Large=0%
        Flash Cache:     IOPS=6.6
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=0.7
        Large I/O's:     IOPS=0.0
        Consumer Group: LOWPRIO_GROUP
        Utilization:     Small=0%    Large=3%
        Flash Cache:     IOPS=2222
        Disk Throughput: MBPS=52
        Small I/O's:     IOPS=0.0
        Large I/O's:     IOPS=50.4

You can see that IORM is in action: the low priority group is literally starved out, which is correct if you consider the plan I implemented. I have previously put my reallybigtable to good use and ended up with a lot of it in flash cache (thanks to cellsrv 11.2.3.3.x+ you benefit from flash cache during smart scans without having to pin the segment to it). Interestingly the disk throughput is 0 MB per second for the high priority group (which looks quite wrong to me, I need to debug the script). The low priority consumer group apparently makes use of spinning disk, although the combined throughput is lower.

After the 25 sessions from the high priority resource consumer group finished, it looked better for the sessions in the low priority group:

Database: DBM01
Utilization:     Small=0%    Large=88%
Flash Cache:     IOPS=44351
Disk Throughput: MBPS=0
Small I/O's:     IOPS=3.2    Avg qtime=0.0ms
Large I/O's:     IOPS=1080    Avg qtime=830ms
        Consumer Group: _ORACLE_BACKGROUND_GROUP_
        Utilization:     Small=0%    Large=0%
        Flash Cache:     IOPS=8.4
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=3.2  
        Large I/O's:     IOPS=0.0
        Consumer Group: LOWPRIO_GROUP
        Utilization:     Small=0%    Large=88%
        Flash Cache:     IOPS=44342
        Disk Throughput: MBPS=0
        Small I/O's:     IOPS=0.0
        Large I/O's:     IOPS=1080

You can no longer see that there are other sessions than background and LOWPRIO_GROUP active. The number of IOPS from Flash Cache went up significantly to the extent that disk isn’t reported to be used anymore. I need to verify that-it doesn’t look quite right to me either.

Summary

IO Resource Manager sadly is a much underused tool and few Exadata users seem to know it is active as soon as you implement a DBRM plan (the objective parameter notwithstanding). I am hoping that this post shows that using DBRM/IORM on Exadata is worth investigating time and effort to as it is a key enabler for database consolidation and more fancy topics such as DBaaS. And you don’t get something comparable on anything else that runs the Oracle software.

Posted in 12c Release 1, Exadata | Tagged: | Leave a Comment »

A brief history of time^H^H Oracle session statistics

Posted by Martin Bach on August 6, 2014

I didn’t intend to write another blog post yesterday evening at all, but found something that was worth sharing and got me excited… And when I started writing I intended it to be a short post, too.

If you have been digging around Oracle session performance counters a little you undoubtedly noticed how their number has increased with every release, and even with every patch set. Unfortunately I don’t have a 11.1 system (or earlier) at my disposal to test, but here is a comparison of how Oracle has instrumented the database. I have already ditched my 12.1.0.1 system as well, so no comparison there either :( This is Oracle on Linux.

The script

In the following examples I am going to use a simple query to list the session statistics by their class. The decode statement is based on the official documentation set. There you find the definition of v$statname plus an explanation of the meaning of the class-column. Here is the script:

with stats as (
        select name, decode(class,
                1, 'USER',
                2, 'REDO',
                4, 'ENQUEUE',
                8, 'CACHE',
                16, 'OS',
                32, 'RAC',
                64, 'SQL',
                128, 'DEBUG',
                'NA'
        ) as decoded_class from v$statname
)
select count(decoded_class), decoded_class
 from stats
 group by rollup(decoded_class)
 order by 1
/

Oracle 11.2.0.3

11.2.0.3 is probably the most common 11g Release 2 version currently out there in the field. Or at least that’s my observation. According to MOS Doc ID 742060.1 11.2.0.3 was released on 23 September 2011 (is that really that long ago?) and already out of error correction support by the way.

Executing the above-mentioned script gives me the following result:

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  25 RAC
                  32 REDO
                  47 NA
                  93 SQL
                 107 USER
                 121 CACHE
                 188 DEBUG
                 638

So there are 638 of these counters. Let’s move on to 11.2.0.4

Oracle 11.2.0.4

Oracle 11.2.0.4 is interesting as it has been released after 12.1.0.1. It is the terminal release for Oracle 11.2, and you should consider migrating to it as it is in error correction support. The patch set came out on 28 August 2013. What about the session statistics?

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  25 RAC
                  34 REDO
                  48 NA
                  96 SQL
                 117 USER
                 127 CACHE
                 207 DEBUG
                 679

A few more, all within what can be expected.

Oracle 12.1.0.2

Oracle 12.1.0.2 is fresh off the press, released just a few weeks ago. Unsurprisingly the number of session statistics has been increased again. What did surprise me was the number of statistics now available for every session! Have a look at this:

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  35 RAC
                  68 REDO
                  74 NA
                 130 SQL
                 130 USER
                 151 CACHE
                 565 DEBUG
                1178

That’s nearly double what you found for 11.2.0.3. Incredible, and hence this post. Comparing 11.2.0.4 with 12.1.0.2 you will notice the:

  • same number of enqueue stats
  • same number of OS stats
  • 10 additional RAC stats
  • twice the number of REDO related statistics
  • quite a few more not classified (26)
  • 34 more sql related
  • 13 more in the user-class
  • 24 additional stats in the cache-class
  • and a whopping 298 (!) in the debug class

The debug class (128) shows lots of statistics (including spare ones) for the in-memory option (IM):

SQL> select count(1), class from v$statname where name like 'IM%' group by class;

  COUNT(1)      CLASS
---------- ----------
       211        128

Happy troubleshooting! Reminds me to look into the IM-option in more detail.

Posted in 11g Release 2, 12c Release 1 | Tagged: , , | 2 Comments »

A first look at RAC 12c (part I)

Posted by Martin Bach on August 4, 2014

I have recently upgraded my RAC 12.1.0.1.3 system to RAC 12.1.0.2 including the RDBMS installation. Currently I am updating my skills with information relevant to what I would normally have called 12c Release 2 (so that would also answer the question: when is 12c Release 2 coming out?). Then I realised I haven’t posted a first look at RAC 12c post yet-so here it comes.

There are a few things that aren’t specifically mentioned in the new features guide that caught my eye. First of all, RAC 12 does a few really cool things. Have a look at the srvctl command output:

[oracle@rac12node1 ~]$ srvctl
Usage: srvctl <command> <object> [<options>]
    commands: enable|disable|export|import|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|
       config|convert|update|upgrade|downgrade|predict
    objects: database|instance|service|nodeapps|vip|network|asm|diskgroup|listener|srvpool|server|scan|scan_listener|
        oc4j|home|filesystem|gns|cvu|havip|exportfs|rhpserver|rhpclient|mgmtdb|mgmtlsnr|volume|mountfs
For detailed help on each command and object and its options use:
  srvctl <command> -help [-compatible] or
  srvctl <command> <object> -help [-compatible]
[oracle@rac12node1 ~]$

Quite a few more than with 11.2.0.3:

[oracle@rac112node1 ~]$ srvctl
Usage: srvctl <command> <object> [<options>]
 commands: enable|disable|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|config|convert|upgrade
 objects: database|instance|service|nodeapps|vip|network|asm|diskgroup|listener|srvpool|server|scan|scan_listener|oc4j|home|filesystem|gns|cvu
For detailed help on each command and object and its options use:
 srvctl <command> -h or
 srvctl <command> <object> -h

I will detail the meaning of some of these later in this post or another one to follow.

Evaluation and Prediction

When you are working with policy managed databases RAC 12c already gave you a “what if” option in form of the -eval flag. If for example you wanted to grow your server pool from 2 to 3 nodes:

[oracle@rac12node1 ~]$ srvctl modify srvpool -serverpool pool1 -max 3 -eval -verbose
Database two will be started on node rac12node3
Server rac12node3 will be moved from pool Free to pool pool1
[oracle@rac12node1 ~]$

Now you will be able to predict a resource failure as well:

[oracle@rac12node1 ~]$ srvctl predict -h

The SRVCTL predict command evaluates the consequences of resource failure.

Usage: srvctl predict database -db <database_name> [-verbose]
Usage: srvctl predict service -db <database_name> -service <service_name> [-verbose]
Usage: srvctl predict asm [-node <node_name>] [-verbose]
Usage: srvctl predict diskgroup -diskgroup <diskgroup_name> [-verbose]
Usage: srvctl predict filesystem -device <volume_device> [-verbose]
Usage: srvctl predict vip -vip <vip_name> [-verbose]
Usage: srvctl predict network [-netnum <network_number>] [-verbose]
Usage: srvctl predict listener -listener <listener_name> [-verbose]
Usage: srvctl predict scan -scannumber <scan_ordinal_number> [-netnum <network_number>] [-verbose]
Usage: srvctl predict scan_listener -scannumber <scan_ordinal_number> [-netnum <network_number>] [-verbose]
Usage: srvctl predict oc4j [-verbose]

So what would happen if a disk group failed?

[oracle@rac12node1 ~]$ srvctl predict diskgroup -diskgroup DATA -verbose
Resource ora.DATA.dg will be stopped
Resource ora.DATA.ORAHOMEVOL.advm will be stopped
[oracle@rac12node1 ~]$

What it doesn’t do at this stage seems to be an assessment of cascading further problems. If +DATA went down, it would pretty much drag the whole cluster with it, too.

Status

Interestingly you can see a lot more detail with 12.1.0.2 than previously. Here is an example of a policy-managed RAC One Node database:

[oracle@rac12node1 ~]$ srvctl config database -d RONNCDB
Database unique name: RONNCDB
Database name: RONNCDB
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile: +DATA/RONNCDB/PARAMETERFILE/spfile.319.854718651
Password file: +DATA/RONNCDB/PASSWORD/pwdronncdb.290.854718263
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: ronpool1
Disk Groups: RECO,DATA
Mount point paths:
Services: NCDB
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: RONNCDB
Candidate servers:
OSDBA group: dba
OSOPER group:
Database instances:
Database is policy managed

Did you spot the OSDBA and OSOPER group mappings in the output? DBCA by default creates the password file and server parameter file into ASM since 12.1.0.1.

You can get a lot more status information in 12.1.0.2 then previously, especially when compared to 11.2:

[oracle@rac12node1 ~]$ srvctl status -h

The SRVCTL status command displays the current state of the object.

Usage: srvctl status database {-db <db_unique_name> [-serverpool <serverpool_name>] | -serverpool <serverpool_name> | -thisversion | -thishome} [-force] [-verbose]
Usage: srvctl status instance -db <db_unique_name> {-node <node_name> | -instance <inst_name_list>} [-force] [-verbose]
Usage: srvctl status service {-db <db_unique_name> [-service  "<service_name_list>"] | -serverpool <serverpool_name> [-db <db_unique_name>]} [-force] [-verbose]
Usage: srvctl status nodeapps [-node <node_name>]
Usage: srvctl status vip {-node <node_name> | -vip <vip_name>} [-verbose]
Usage: srvctl status listener [-listener <lsnr_name>] [-node <node_name>] [-verbose]
Usage: srvctl status asm [-proxy] [-node <node_name>] [-detail] [-verbose]
Usage: srvctl status scan [[-netnum <network_number>] [-scannumber <scan_ordinal_number>] | -all] [-verbose]
Usage: srvctl status scan_listener [[-netnum <network_number>] [-scannumber <scan_ordinal_number>] | -all] [-verbose]
Usage: srvctl status srvpool [-serverpool <pool_name>] [-detail]
Usage: srvctl status server -servers "<server_list>" [-detail]
Usage: srvctl status oc4j [-node <node_name>] [-verbose]
Usage: srvctl status rhpserver
Usage: srvctl status rhpclient
Usage: srvctl status home -oraclehome <oracle_home> -statefile <state_file> -node <node_name>
Usage: srvctl status filesystem [-device <volume_device>] [-verbose]
Usage: srvctl status volume [-device <volume_device>] [-volume <volume_name>] [-diskgroup <group_name>] [-node <node_list> | -all]
Usage: srvctl status diskgroup -diskgroup <dg_name> [-node "<node_list>"] [-detail] [-verbose]
Usage: srvctl status cvu [-node <node_name>]
Usage: srvctl status gns [-node <node_name>] [-verbose]
Usage: srvctl status mgmtdb [-verbose]
Usage: srvctl status mgmtlsnr [-verbose]
Usage: srvctl status exportfs [-name <expfs_name> |-id <havip id>]
Usage: srvctl status havip [-id <id>]
Usage: srvctl status mountfs -name <mountfs_name>
For detailed help on each command and object and its options use:
  srvctl <command> <object> -help [-compatible]

RAC 12.1.0.2 adds a nifty few little flags: thisversion and thishome to srvctl status database. That works really well where you have multiple versions of Oracle on the same machine (think consolidation):

[oracle@rac12node1 ~]$ srvctl status database -thisversion
Database unique name: RONNCDB
Instance RONNCDB_1 is running on node rac12node4
Online relocation: INACTIVE

Database unique name: TWO
Instance TWO_1 is running on node rac12node1
Instance TWO_2 is running on node rac12node2

Verbose!

Some commands are actually more verbose when you specify the -verbose flag:

[oracle@rac12node1 ~]$ srvctl status database -d RONNCDB -verbose
Instance RONNCDB_1 is running on node rac12node4 with online services NCDB. Instance status: Open.
Online relocation: INACTIVE
[oracle@rac12node1 ~]$ srvctl status database -d RONNCDB
Instance RONNCDB_1 is running on node rac12node4
Online relocation: INACTIVE
[oracle@rac12node1 ~]$

But that’s not new in 12.1.0.2 I believe.

Interesting changes for database logging

The database itself will also tell you more about memory allocation:

**********************************************************************
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
 Per process system memlock (soft) limit = 64K
Thu Jul 31 13:34:58 2014
 Expected per process system memlock (soft) limit to lock
 SHARED GLOBAL AREA (SGA) into memory: 1538M
Thu Jul 31 13:34:58 2014
 Available system pagesizes:
  4K, 2048K
 Supported system pagesize(s):
  PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)
        4K       Configured               3          393219        NONE
     2048K                0             769               0        NONE

RECOMMENDATION:
 1. For optimal performance, configure system with expected number
 of pages for every supported system pagesize prior to the next
 instance restart operation.
 2. Increase per process memlock (soft) limit to at least 1538MB
 to lock 100% of SHARED GLOBAL AREA (SGA) pages into physical memory

As you can see I am not using large pages here at all, which I did for demonstration purposes only. I don’t see any reason not to use large pages on a 64bit system these days. I’m curious to see whether the AIX port supports all the AIX page sizes here.

End of part I

This has already turned into a longer post than I expected it to be when I started writing. I think I’ll continue the series in a couple of weeks when I’m finding the time.

Posted in 12c Release 1, RAC | Leave a Comment »

How to get insights into the Linux Kernel

Posted by Martin Bach on August 1, 2014

This is probably as much a note-to-self as it can possibly be. Recently I have enjoyed some more in-depth research about how the Linux kernel works. To that extent I started fairly low-level. Theoretically speaking, you need to understand the hardware-software interface first before you can understand the upper levels. But in practice you get by with less knowledge. But if you are truly interested in how computers work you might want to consider reading up on some background. Some very knowledgable people I deeply respect have recommended books by David A. Patterson and John L. Hennessy. I have these two:

  • Computer Organization and Design, Fifth Edition: The Hardware/Software Interface
  • Computer Architecture, Fifth Edition: A Quantitative Approach

I think I found a few references to the above books in James Morle’s recent blog article about the true cost of licensing the in-memory database option and he definitely refers to the second book in his Sane SAN paper. I complemented these books with The Linux Programming Interface: A Linux and UNIX System Programming Handbook to get an overview of the Linux API. Oh and Linux from Scratch is a great resource too!

The Foundation is set

But now-what next? The Linux kernel evolves rather quickly, and don’t be fooled by version numbers. The “enterprise” kernels keep a rather conservative, static version number scheme. Remember 2.6.18? The kernel with RHEL 5.10 has little in common with the one released years and years ago with RHEL 5.0. SuSE seems to be more aggressive, naming kernels differently. A good discussion of the pros and cons for that approach can be found on LWN: http://lwn.net/Articles/486304/ Long story short: the Linux kernel developers keep pushing the limits with the “upstream” or “vanilla” kernel. You can follow the development on the LKML or Linux Kernel Mailing List. But that list is busy… The distribution vendors in turn take a stable version of the kernel and add features they need. That includes back-porting as well, which is why it’s so hard to see what’s going on with a kernel internally. But there are exceptions.

The inner workings

Apologies to all SuSE and Red Hat geeks: I haven’t been able to find a web-repository for the kernel code! If you know of one and have the URL, let me know and I’ll add it here. I don’t want to sound biased but it simply happens to be that I know Oracle Linux best.

Now to really dive into the internals and implementation you need to look at the source code. When browsing the code it helps to understand the C-programming language. And maybe some Assembler. I would love to know more about Assembler than I do but I don’t believe it’s strictly speaking necessary.

Oracle publishes the kernel code at the GIT repositories on  oss.oracle.com:

Oracle also provides patches for Red Hat kernels in project Red Patch. If I understand things correctly then Red Hat provides changes to the kernel in a massive tarball with the patches already applied. Previously it appears to have shipped the kernel + patches, which caused some controversy.

The Linux Cross Reference gives you insights into the upstream kernel.

NB: Kernel documentation can be found in the Documentation subdirectory. This is very useful stuff!

Now why would you want to do this?

My use case! I wanted to find out if/how I could do NFS over RDMA. When in doubt, use an Internet search engine and common sense. In this case: use the kernel documentation and sure enough, NFS-RDMA seems possible.

https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt

The link suggests a few module names and pre-requisites on enabling NFA-RDMA. The nfs-utils package must be version 1.1.2 or later, and the kernel NFS server must be built with RDMA support. Using the kernel source RPM you can check the options being used for compiling the kernel. Normally you’d use make menuconfig or an equivalent to enable/disable options or to build them as modules (refer to the excellent Linux From Scratch). Except that you don’t do that with the enterprise distributions of course. Building kernels for fun is off limits on these. If you have a problem with the Linux kernel (like a buggy kernel module), your vendor provides the fix, not the Linux engineer. But I digress… Each subtree in the kernel has a Kconfig file that lists the configuration option and meaning.

For the purpose of NFS-RDMA Infiniband support must be enabled (no brainer), but also IPoIB and then the RDMA support for NFS (“sunrpc”).

Back to the source RPM: it installs a file called .config in /usr/src/kernels/nameAndVersion/ listing all the build options. Grepping for RDMA in the file shows the following for UEK 3:

[root@rac12node1 3.8.13-35.3.3.el6uek.x86_64]# grep -i rdma .config
CONFIG_RDS_RDMA=m
CONFIG_NET_9P_RDMA=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_INFINIBAND_OCRDMA is not set
CONFIG_SUNRPC_XPRT_RDMA_CLIENT=m
# CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL is not set
CONFIG_SUNRPC_XPRT_RDMA_SERVER=m

And here is the same for UEK 2:

[root@server1 2.6.39-400.17.1.el6uek.x86_64]# grep -i rdma .config
CONFIG_RDS_RDMA=m
CONFIG_NET_9P_RDMA=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
CONFIG_SUNRPC_XPRT_RDMA=m

So that looks promising, the letter “m” stands for “module”. But what do these options mean? The Kconfig file to the rescue again, but I first have to find the correct one. This example is for UEK 2:

[root@server1 2.6.39-400.17.1.el6uek.x86_64]# for file in $(rpm -qil kernel-uek-devel | grep Kconfig );
> do grep -i SUNRPC_XPRT_RDMA $file /dev/null;
> done
/usr/src/kernels/2.6.39-400.17.1.el6uek.x86_64/net/sunrpc/Kconfig:config SUNRPC_XPRT_RDMA

Found you! Notice that I’m adding /dev/null to the grep command to get the file name where grep found a match. Looking at the file just found:

config SUNRPC_XPRT_RDMA
        tristate
        depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS && EXPERIMENTAL
        default SUNRPC && INFINIBAND
        help
          This option allows the NFS client and server to support
          an RDMA-enabled transport.

          To compile RPC client RDMA transport support as a module,
          choose M here: the module will be called xprtrdma.

          If unsure, say N.

All that remained to be done was to check if these other configurationvariables (INFINIBAND, INFINIBAND_ADDR_TRANS etc) were set in the top level .config file and they were.

Posted in Linux | Tagged: , , | 1 Comment »

Upgrading clustered Grid Infrastructure to 12.1.0.2 from 12.1.0.1.3

Posted by Martin Bach on July 29, 2014

Oracle 12.1.0.2 is out, after lots of announcements the product has finally been released. I had just extended my 12.1.0.1.3 cluster to 3 nodes and was about to apply the July PSU when I saw the news. So why not try and upgrade to the brand new thing?

What struck me at first was the list of new features … Oracle’s patching strategy has really changed over time. I remember the days when Oracle didn’t usually add additional features into point releases. Have a look at the new 12.1.0.2 features and that would possibly qualify to be 12c Release 2…

In summary the upgrade process is actually remarkably simple, and hasn’t changed much since earlier versions of the software. Here are the steps in chronological order.

./runInstaller

I don’t know how often I have type ./ruinInstaller instead of runInstaller, but here you go. This is the first wizard screen after splash screen has disappeared.

GI 12.1.0.2-001

Naturally I went for the upgrade of my cluster. Before launching the installer though I made sure that everything was in working order by means of cluvfy. On to the next screen:

GI 002

always install English only. Troubleshooting Oracle in a different language (especially if I don’t speak or understand) is really hard so I avoid it in the first place.

Over to the screen that follows and oops-my SYSDG disk group (containing OCR and voting files) is too small. Bugger. In the end I added 3 new 10GB LUNs and dropped the old ones. But it took me a couple of hours to do so. Worse: it wasn’t even needed, but proved to be a good learning exercise. The requirement to have that much free space is most likely caused by the management repository and related infrastructure.

GI 003 error

Back to this screen everything is in best order, the print screen has been taken just prior to the change to the next. Note the button to skip the updates on unreachable nodes. Not sure if I wanted to do that though.

GI 003

I haven’t got OEM agents on the servers (yet) so I’m skipping the registration for now. You can always do that later.

GI 004

This screen is familiar; I am keeping my choices from the initial installation. Grid Infrastructure is owned by Oracle despite the ASMDBA and ASMADMIN groups by the way.

GI 005

On the screen below you define where on the file system you want to install Grid Infrastructure. Remember that for clustered deployments the ORACLE_HOME cannot be in the path of the ORACLE_BASE. For this to work you have to jump to the command line and create the directory on all servers and grant ownership to the GI owner account (oracle in this case, could be grid as well).

GI 006

Since I like to be in control I don’t allow Oracle to run the root scripts. I didn’t in 12.1.0.1 either:

GI 007

In that screen you notice the familiar checking of requirements.

GI 008

In my case there were only a few new ones shown here. This is a lab server so I don’t plan on using swap, but the kernel parameter “panic_on_oops” is new. I also didn’t set the reverse path filtering which I corrected before continuing. Interestingly the installer points out that there is a change in the asm_diskstring with its implications.

One thing I haven’t recorded here (because I am using Oracle Linux 6.5 with UEK3) is the requirement for using a 2.6.39 kernel – that sounds like UEK2 to me.

Update: my system is Oracle Linux 6.5, not Red Hat. See Sigrid’s comments below: for Red Hat Linux there doesn’t seem to be a similar requirement to use UEK 2, which matches the documentation (Installation guide for Grid Infrastructure/Linux).

Another interesting case was that the kernel_core pattern wasn’t equal on all nodes. Turned out that 2 nodes had the package abrt installed, and the other two didn’t. Once the packages were installed on all nodes, the warning went away.

GI 009

Unfortunately I didn’t take a print screen of the summary in case you wonder where that is. I went straight into the installation phase:

GI 010

At the end of which you are prompted to run the upgrade scripts. Remember to run them in screen and pay attention to the order you run them in.

GI 011

The output from the last node is shown here:

[root@rac12node3 ~]# /u01/app/12.1.0.2/grid/rootupgrade.sh
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME=  /u01/app/12.1.0.2/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The file "oraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/12.1.0.2/grid/crs/install/crsconfig_params
2014/07/26 16:15:51 CLSRSC-4015: Performing install or upgrade action for Oracle
Trace File Analyzer (TFA) Collector.

2014/07/26 16:19:58 CLSRSC-4003: Successfully patched Oracle Trace File Analyzer (TFA) Collector.

2014/07/26 16:20:02 CLSRSC-464: Starting retrieval of the cluster configuration data

2014/07/26 16:20:51 CLSRSC-465: Retrieval of the cluster configuration data has successfully completed.

2014/07/26 16:20:51 CLSRSC-363: User ignored prerequisites during installation

ASM configuration upgraded in local node successfully.

2014/07/26 16:21:16 CLSRSC-466: Starting shutdown of the current Oracle Grid Infrastructure stack

2014/07/26 16:22:51 CLSRSC-467: Shutdown of the current Oracle Grid Infrastructure stack has successfully
completed.

OLR initialization - successful
2014/07/26 16:26:53 CLSRSC-329: Replacing Clusterware entries in file 'oracle-ohasd.conf'

CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Oracle High Availability Services has been started.
2014/07/26 16:34:34 CLSRSC-343: Successfully started Oracle Clusterware stack

clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 12c Release 1.
Successfully taken the backup of node specific configuration in OCR.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
2014/07/26 16:35:55 CLSRSC-478: Setting Oracle Clusterware active version on the last node to be upgraded

2014/07/26 16:35:55 CLSRSC-482: Running command: '/u01/app/12.1.0.2/grid/bin/crsctl set crs activeversion'

Started to upgrade the Oracle Clusterware. This operation may take a few minutes.
Started to upgrade the CSS.
The CSS was successfully upgraded.
Started to upgrade Oracle ASM.
Started to upgrade the CRS.
The CRS was successfully upgraded.
Successfully upgraded the Oracle Clusterware.
Oracle Clusterware operating version was successfully set to 12.1.0.2.0
2014/07/26 16:38:51 CLSRSC-479: Successfully set Oracle Clusterware active version

2014/07/26 16:39:13 CLSRSC-476: Finishing upgrade of resource types

2014/07/26 16:39:26 CLSRSC-482: Running command: 'upgrade model  -s 12.1.0.1.0 -d 12.1.0.2.0 -p last'

2014/07/26 16:39:26 CLSRSC-477: Successfully completed upgrade of resource types

2014/07/26 16:40:17 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

Did you notice that TFA has been added? Trace File Analyzer is another of these cool things to play with, it was available with 11.2.0.4 and as an add-on to 12.1.0.1.

Result!

Back to OUI to complete the upgrade. After which cluvfy performs a final check and I’m done. Prove it worked:

[oracle@rac12node1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Jul 26 17:13:02 2014

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE	12.1.0.2.0	Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production

SQL>

In another post I’ll detail the upgrade for my databases. I am particularly interested about the unplug/plug way of migrating…

Posted in 12c Release 1, KVM, Linux, RAC | Tagged: , | 6 Comments »

Implications of threaded_execution = true in 12c

Posted by Martin Bach on July 21, 2014

I had an interesting discussion as part of my latest presentation at the UKOUG RAC CIA & Database Combined SIG. Part of my talk was about the implications of the new threaded execution model in Oracle.

Since “we do not use Windows” (except for gaming) I can’t compare the Windows thread model to the new 12c implementation on UNIX. There are however interesting implications when switching to the new model, some of which I’d like to demonstrate here. First of all, threaded execution is not enabled by default. With 12.1.0.1.3 on top of Oracle Restart you get the either all or a subset of the following background and auxiliary processes for a CDB:

[oracle@london ~]$ ps -eLf | egrep -i "UID|$ORACLE_SID"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle    6041     1  6041  0    1 08:56 ?        00:00:00 ora_pmon_CDB1
oracle    6043     1  6043  0    1 08:56 ?        00:00:00 ora_psp0_CDB1
oracle    6045     1  6045  2    1 08:56 ?        00:00:08 ora_vktm_CDB1
oracle    6049     1  6049  0    1 08:56 ?        00:00:00 ora_gen0_CDB1
oracle    6051     1  6051  0    1 08:56 ?        00:00:00 ora_mman_CDB1
oracle    6055     1  6055  0    1 08:56 ?        00:00:00 ora_diag_CDB1
oracle    6057     1  6057  0    1 08:56 ?        00:00:00 ora_dbrm_CDB1
oracle    6059     1  6059  0    1 08:56 ?        00:00:00 ora_dia0_CDB1
oracle    6061     1  6061  0    1 08:56 ?        00:00:00 ora_dbw0_CDB1
oracle    6063     1  6063  0    1 08:56 ?        00:00:00 ora_lgwr_CDB1
oracle    6065     1  6065  0    1 08:56 ?        00:00:00 ora_ckpt_CDB1
oracle    6067     1  6067  0    1 08:56 ?        00:00:00 ora_lg00_CDB1
oracle    6069     1  6069  0    1 08:56 ?        00:00:00 ora_lg01_CDB1
oracle    6071     1  6071  0    1 08:56 ?        00:00:00 ora_smon_CDB1
oracle    6073     1  6073  0    1 08:56 ?        00:00:00 ora_reco_CDB1
oracle    6075     1  6075  0    1 08:56 ?        00:00:00 ora_lreg_CDB1
oracle    6077     1  6077  0    1 08:56 ?        00:00:00 ora_rbal_CDB1
oracle    6079     1  6079  0    1 08:56 ?        00:00:00 ora_asmb_CDB1
oracle    6081     1  6081  0    1 08:56 ?        00:00:00 ora_mmon_CDB1
oracle    6083     1  6083  0    1 08:56 ?        00:00:00 ora_mmnl_CDB1
grid      6086     1  6086  0    1 08:56 ?        00:00:00 oracle+ASM_asmb_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6087     1  6087  0    1 08:56 ?        00:00:00 ora_d000_CDB1
oracle    6089     1  6089  0    1 08:56 ?        00:00:00 ora_s000_CDB1
oracle    6092     1  6092  0    1 08:56 ?        00:00:00 ora_mark_CDB1
oracle    6094     1  6094  0    1 08:56 ?        00:00:00 ora_ocf0_CDB1
oracle    6109     1  6109  0    1 08:56 ?        00:00:00 ora_o000_CDB1
grid      6111     1  6111  0    1 08:56 ?        00:00:00 oracle+ASM_ocf0_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      6113     1  6113  0    1 08:56 ?        00:00:00 oracle+ASM_o000_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6118     1  6118  0    1 08:56 ?        00:00:00 ora_nss2_CDB1
oracle    6148     1  6148  0    1 08:56 ?        00:00:01 ora_p000_CDB1
oracle    6150     1  6150  0    1 08:56 ?        00:00:02 ora_p001_CDB1
oracle    6154     1  6154  0    1 08:56 ?        00:00:00 ora_o001_CDB1
grid      6156     1  6156  0    1 08:56 ?        00:00:00 oracle+ASM_o001_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6158     1  6158  0    1 08:56 ?        00:00:00 ora_tmon_CDB1
oracle    6160     1  6160  0    1 08:56 ?        00:00:00 ora_arc0_CDB1
oracle    6162     1  6162  0    1 08:56 ?        00:00:00 ora_arc1_CDB1
oracle    6164     1  6164  0    1 08:56 ?        00:00:00 ora_arc2_CDB1
oracle    6166     1  6166  0    1 08:56 ?        00:00:00 ora_arc3_CDB1
oracle    6169     1  6169  0    1 08:57 ?        00:00:00 ora_tt00_CDB1
oracle    6171     1  6171  0    1 08:57 ?        00:00:00 ora_smco_CDB1
oracle    6176     1  6176  0    1 08:57 ?        00:00:00 ora_aqpc_CDB1
oracle    6180     1  6180  0    1 08:57 ?        00:00:00 ora_p002_CDB1
oracle    6182     1  6182  0    1 08:57 ?        00:00:00 ora_p003_CDB1
oracle    6184     1  6184  0    1 08:57 ?        00:00:00 ora_p004_CDB1
oracle    6186     1  6186  0    1 08:57 ?        00:00:00 ora_p005_CDB1
oracle    6188     1  6188  0    1 08:57 ?        00:00:00 ora_p006_CDB1
oracle    6190     1  6190  0    1 08:57 ?        00:00:00 ora_p007_CDB1
oracle    6192     1  6192  0    1 08:57 ?        00:00:00 ora_qm02_CDB1
oracle    6194     1  6194  0    1 08:57 ?        00:00:00 ora_qm00_CDB1
oracle    6196     1  6196  0    1 08:57 ?        00:00:00 ora_q002_CDB1
oracle    6200     1  6200  0    1 08:57 ?        00:00:00 ora_q004_CDB1
oracle    6202     1  6202  0    1 08:57 ?        00:00:00 ora_q005_CDB1
oracle    6204     1  6204  0    1 08:57 ?        00:00:00 ora_q006_CDB1
oracle    6206     1  6206  0    1 08:57 ?        00:00:00 ora_q007_CDB1
oracle    6208     1  6208  0    1 08:57 ?        00:00:00 ora_q008_CDB1
oracle    6210     1  6210  0    1 08:57 ?        00:00:00 ora_q009_CDB1
oracle    6212     1  6212  0    1 08:57 ?        00:00:00 ora_q00a_CDB1
oracle    6292     1  6292  0    1 08:57 ?        00:00:00 ora_w000_CDB1
oracle    6318     1  6318  0    1 08:57 ?        00:00:00 ora_cjq0_CDB1
oracle    6320     1  6320  0    1 08:57 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    6322     1  6322  0    1 08:57 ?        00:00:01 oracleCDB1 (LOCAL=NO)
oracle    6324     1  6324  0    1 08:57 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    6326     1  6326  0    1 08:57 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    6407     1  6407  0    1 08:57 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    6438     1  6438  0    1 08:57 ?        00:00:00 ora_p008_CDB1
oracle    6440     1  6440  0    1 08:57 ?        00:00:00 ora_p009_CDB1
oracle    6442     1  6442  0    1 08:57 ?        00:00:01 oracleCDB1 (LOCAL=NO)
oracle    6490     1  6490  0    1 08:58 ?        00:00:00 ora_p00a_CDB1
oracle    6492     1  6492  0    1 08:58 ?        00:00:00 ora_p00b_CDB1
oracle    6640  5653  6640  0    1 09:01 pts/0    00:00:00 egrep -i UID|CDB1

Quite a few more than in 9i it seems. Most of the background processes are documented in the database reference. As you can see, there are only processes (-> NLWP = 1), no threads. If you wondered about the strange output, bear with me, there is a reason I chose those flags to the ps command.

The change with the threaded model

Let’s enable threaded execution. It’s simple:

SYS@CDB$ROOT> alter system set threaded_execution=true scope=spfile;

System altered.

SYS@CDB$ROOT> shutdown immediate

Database closed.
Database dismounted.
ORACLE instance shut down.
SYS@CDB$ROOT> startup
ORA-01017: invalid username/password; logon denied

There’s the first surprise. Not to worry, it’s just that OS level authentication doesn’t work anymore, but you can continue by providing your username and password:

SYS@CDB$ROOT> conn sys as sysdba
Enter password:
Connected.
SYS@CDB$ROOT> alter database mount;

Database altered.

SYS@CDB$ROOT> alter database open;

Database altered.

SYS@CDB$ROOT>

What does that mean for our processes? First let’s use the old way of checking for Oracle processes using just ps -ef:

[oracle@london ~]$ ps -ef | egrep -i "UID|$ORACLE_SID"
UID        PID  PPID  C STIME TTY          TIME CMD
oracle    7138     1  0 09:04 ?        00:00:00 ora_pmon_CDB1
oracle    7140     1  0 09:04 ?        00:00:00 ora_psp0_CDB1
oracle    7142     1  2 09:04 ?        00:00:05 ora_vktm_CDB1
oracle    7146     1  0 09:04 ?        00:00:01 ora_u004_CDB1
oracle    7152     1  4 09:04 ?        00:00:08 ora_u005_CDB1
oracle    7158     1  0 09:04 ?        00:00:00 ora_dbw0_CDB1
grid      7173     1  0 09:04 ?        00:00:00 oracle+ASM_asmb_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7202     1  0 09:04 ?        00:00:00 oracle+ASM_ocf0_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7226     1  0 09:04 ?        00:00:00 oracle+ASM_o000_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7229     1  0 09:04 ?        00:00:00 oracle+ASM_o001_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7401     1  0 09:06 ?        00:00:00 oracle+ASM_o002_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7407     1  0 09:06 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7494     1  0 09:06 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7526     1  0 09:07 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7573     1  1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7575     1  3 09:07 ?        00:00:02 oracleCDB1 (LOCAL=NO)
oracle    7578     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7592     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7594     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7596     1  0 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7691  5653  0 09:08 pts/0    00:00:00 egrep -i UID|CDB1
[oracle@london ~]$

The list is considerably shorter, and you will notice some strange process names: u004 and u005. If you squint you see that PMON, PSP0, VKTM and DBW0 are still processes. I am checking for threads (or light weight processes as they are also called) in the next step:

[oracle@london ~]$ ps -eLf | egrep -i "UID|$ORACLE_SID"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle    7138     1  7138  0    1 09:04 ?        00:00:00 ora_pmon_CDB1
oracle    7140     1  7140  0    1 09:04 ?        00:00:00 ora_psp0_CDB1
oracle    7142     1  7142  2    1 09:04 ?        00:00:11 ora_vktm_CDB1
oracle    7146     1  7146  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7147  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7148  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7149  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7155  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7159  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7160  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7161  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7162  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7163  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7165  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7166  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7167  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7152     1  7152  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7153  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7154  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7156  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7164  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7168  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7169  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7170  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7172  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7174  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7175  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7224  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7227  0   46 09:04 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7249  0   46 09:05 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7361  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7362  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7365  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7366  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7367  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7396  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7397  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7404  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7405  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7416  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7418  0   46 09:06 ?        00:00:01 ora_u005_CDB1
oracle    7152     1  7419  0   46 09:06 ?        00:00:02 ora_u005_CDB1
oracle    7152     1  7420  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7421  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7422  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7423  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7424  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7425  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7426  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7427  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7428  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7430  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7431  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7432  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7433  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7434  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7435  0   46 09:06 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7576  0   46 09:07 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7618  0   46 09:07 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7619  0   46 09:07 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7696  0   46 09:08 ?        00:00:00 ora_u005_CDB1
oracle    7152     1  7697  0   46 09:08 ?        00:00:00 ora_u005_CDB1
oracle    7158     1  7158  0    1 09:04 ?        00:00:00 ora_dbw0_CDB1
grid      7173     1  7173  0    1 09:04 ?        00:00:00 oracle+ASM_asmb_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7226     1  7226  0    1 09:04 ?        00:00:00 oracle+ASM_o000_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7229     1  7229  0    1 09:04 ?        00:00:00 oracle+ASM_o001_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      7401     1  7401  0    1 09:06 ?        00:00:00 oracle+ASM_o002_cdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7494     1  7494  0    1 09:06 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7526     1  7526  0    1 09:07 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7573     1  7573  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7575     1  7575  1    1 09:07 ?        00:00:02 oracleCDB1 (LOCAL=NO)
oracle    7578     1  7578  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7592     1  7592  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7594     1  7594  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7596     1  7596  0    1 09:07 ?        00:00:00 oracleCDB1 (LOCAL=NO)
oracle    7810     1  7810  0    1 09:10 ?        00:00:00 oracleCDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    7840  5653  7840  0    1 09:11 pts/0    00:00:00 egrep -i UID|CDB1

This is where it’s interesting. First let’s match this to the database, focusing on the background processes.

SYS@CDB$ROOT> select pname, pid, sosid, spid, stid, execution_type
  2  from v$process where background = 1
  3  order by pname
  4  /

PNAME        PID SOSID                    SPID                     STID                     EXECUTION_
----- ---------- ------------------------ ------------------------ ------------------------ ----------
AQPC          43 7152_7416                7152                     7416                     THREAD
ARC0          35 7152_7362                7152                     7362                     THREAD
ARC1          36 7152_7365                7152                     7365                     THREAD
ARC2          37 7152_7366                7152                     7366                     THREAD
ARC3          38 7152_7367                7152                     7367                     THREAD
ASMB          22 7146_7167                7146                     7167                     THREAD
CJQ0          68 7152_7576                7152                     7576                     THREAD
CKPT          15 7146_7160                7146                     7160                     THREAD
DBRM          11 7146_7155                7146                     7155                     THREAD
DBW0          13 7158                     7158                     7158                     PROCESS
DIA0          12 7152_7156                7152                     7156                     THREAD
DIAG           9 7152_7154                7152                     7154                     THREAD
GEN0           5 7146_7148                7146                     7148                     THREAD
LG00          16 7146_7161                7146                     7161                     THREAD
LG01          18 7146_7163                7146                     7163                     THREAD
LGWR          14 7146_7159                7146                     7159                     THREAD
LREG          20 7146_7165                7146                     7165                     THREAD
MARK          28 7152_7175                7152                     7175                     THREAD
MMAN           7 7146_7149                7146                     7149                     THREAD
MMNL          24 7152_7169                7152                     7169                     THREAD
MMON          23 7152_7168                7152                     7168                     THREAD
NSS2          33 7152_7249                7152                     7249                     THREAD
O002          40 7152_7397                7152                     7397                     THREAD
PMON           2 7138                     7138                     7138                     PROCESS
PSP0           3 7140                     7140                     7140                     PROCESS
Q002          54 7152_7428                7152                     7428                     THREAD
Q004          56 7152_7430                7152                     7430                     THREAD
Q005          57 7152_7431                7152                     7431                     THREAD
Q006          58 7152_7432                7152                     7432                     THREAD
Q007          59 7152_7433                7152                     7433                     THREAD
Q008          60 7152_7434                7152                     7434                     THREAD
Q009          61 7152_7435                7152                     7435                     THREAD
QM00          53 7152_7427                7152                     7427                     THREAD
QM02          44 7152_7426                7152                     7426                     THREAD
RBAL          21 7146_7166                7146                     7166                     THREAD
RECO          19 7152_7164                7152                     7164                     THREAD
SCMN           6 7146_7146                7146                     7146                     THREAD
SCMN          10 7152_7152                7152                     7152                     THREAD
SMCO          41 7152_7404                7152                     7404                     THREAD
SMON          17 7146_7162                7146                     7162                     THREAD
TMON          34 7152_7361                7152                     7361                     THREAD
TT00          39 7152_7396                7152                     7396                     THREAD
VKTM           4 7142                     7142                     7142                     PROCESS
W000          42 7152_7405                7152                     7405                     THREAD

44 rows selected.

Here you clearly see which one of the background processes is a thread, and which isn’t. The threads will have multiple STIDs per SPID, or thread IDs to the process:

SYS@CDB$ROOT> select count(spid),spid,execution_type from v$process where background = 1 group by spid, execution_type;

COUNT(SPID) SPID                     EXECUTION_
----------- ------------------------ ----------
          1 7158                     PROCESS
         12 7146                     THREAD
          1 7138                     PROCESS
          1 7140                     PROCESS
          1 7142                     PROCESS
         29 7152                     THREAD

6 rows selected.

Circling back to the OS level for SPID 7146-can you see the threads there as well? You can:

[oracle@london ~]$ ps -eLf | egrep -i "UID|7146"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle    7146     1  7146  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7147  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7148  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7149  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7155  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7159  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7160  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7161  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7162  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7163  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7165  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7166  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    7146     1  7167  0   13 09:04 ?        00:00:00 ora_u004_CDB1
oracle    8156  6876  8156  0    1 09:19 pts/2    00:00:00 egrep -i UID|7146
[oracle@london ~]$

But what about user sessions?

Let’s connect to the database (a PDB in this case) as a user.

[oracle@london ~]$ sqlplus martin@localhost/pdb4
...
SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 09:22:39 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Fri Jul 18 2014 09:21:27 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@PDB4> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
280
MARTIN@PDB4>

I can see the session from the CDB root:

SYS@CDB$ROOT> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 280 and s.paddr = p.addr;

USERNAME          SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
---------- ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN            280        407          4 8672                     8672                     8672                     PROCESS

SYS@CDB$ROOT>

Interestingly – as the output shows – user processes seem to remain UNIX processes (there is a twist to that too, bear with me). Remember that this process has been created using a net service and the EZConnect syntax. What about my session that bypasses the net*8 layer? I connected using sqlplus sys as sysdba:

SYS@CDB$ROOT> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
------------------------------------------------------------------------------------------------------------------------------------------------------
249

SYS@CDB$ROOT> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 249 and s.paddr = p.addr;

USERNAME          SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
---------- ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
SYS               249        685          1 7152                     7152_8696                8696                     THREAD

SYS@CDB$ROOT>

Now that’s a thread. Surprised? Let’s test on a non CDB. First I try to connect by bypassing the listener:

[oracle@london ~]$ sqlplus martin

SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 10:34:58 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Fri Jul 18 2014 10:33:25 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@NCDB> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
249

MARTIN@NCDB>
-- in a different session
SYS@NCDB> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 249 and s.paddr = p.addr;

USERNAME                              SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
------------------------------ ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN                                249         13          0 14634                    14634_15337              15337                    THREAD

This session is a thread. Now using the listener:

[oracle@london ~]$ tnsping ncdb

TNS Ping Utility for Linux: Version 12.1.0.1.0 - Production on 18-JUL-2014 10:36:33

Copyright (c) 1997, 2013, Oracle.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = london.example.com)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = NCDB)))
OK (0 msec)
[oracle@london ~]$ sqlplus martin@ncdb

SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 10:36:36 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Fri Jul 18 2014 10:36:18 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@NCDB> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
249

And what’s the guess?

SYS@NCDB> r
  1  select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3* where s.sid = 249 and s.paddr = p.addr

USERNAME                              SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
------------------------------ ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN                                249         17          0 15507                    15507                    15507                    PROCESS

You were right-a process. But so far I haven’t shown you my listener.ora file. So far in the testing it didn’t use DEDICATED_THROUGH_BROKER_listener = ON. Let’s set this, and reload the listener. Here is the complete example:

[oracle@london ~]$ sqlplus martin@ncdb

SQL*Plus: Release 12.1.0.1.0 Production on Fri Jul 18 10:45:18 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Fri Jul 18 2014 10:36:38 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

MARTIN@NCDB> select sys_context('userenv','sid') from dual;

SYS_CONTEXT('USERENV','SID')
--------------------------------------------------------------------------------
14

-- and in another session

SYS@NCDB> select s.username, s.sid, s.serial#, s.con_id, p.spid, p.sosid, p.stid, p.execution_type
  2  from v$session s, v$process p
  3  where s.sid = 14 and s.paddr = p.addr;

USERNAME               SID    SERIAL#     CON_ID SPID                     SOSID                    STID                     EXECUTION_
--------------- ---------- ---------- ---------- ------------------------ ------------------------ ------------------------ ----------
MARTIN                  14         25          0 14634                    14634_16287              16287                    THREAD

The listener parameter DEDICATED_THROUGH_BROKER_<listener_name> allows you to create user sessions as threads. And now to wrap up, how do you kill that session? On my Oracle Linux 6.5 system I killed the LWP:

[oracle@london ~]$ ps -eLf | egrep -i "UID|16287"
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
oracle   14634     1 16287  0   32 10:45 ?        00:00:00 ora_u005_NCDB
oracle   16531 16503 16531  1    1 10:48 pts/4    00:00:00 egrep -i UID|16287
[oracle@london ~]$ kill -9 16287

Which terminates my session:

MARTIN@NCDB> select user from dual
  2  /
select user from dual
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 14634 Thread ID: 16287
Session ID: 14 Serial number: 25

My database stayed up and running. Please do NEVER EVER kill the SPID when threaded_execution is enabled!!! You might bring the database down.

Disclaimer

As with everything on this blog, don’t take it literally-test it! Just because it worked for me on my VM with the specific set of packages does not mean this is applicable for other Linux or even Solaris/AIX!

Posted in 12c Release 1, Linux | Tagged: , | 2 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 2,368 other followers