Little things worth knowing: automatic generation of extended statistics in 12c

When you are migrating to Oracle 12c I hope you might this post useful. I came across this feature when researching what’s new with Oracle 12c (and yes I still find lots of new ones I haven’t noticed before). This one is a bit hidden away in section 2.2.4.3 Automatic Column Group Detection of the 12c New Features Guide. And it’s a lot more complex than I first thought! In this first post I’ll try and show the generation of extended statistics in 12c. I am planning on another post to explain how the rest of the adaptive optimisations that are new with 12c fit into the picture.

What is the motivation?

Previously, in Oracle versions up to 12c you needed to be on the lookout for candidates for correlated columns. Extended statistics on groups of columns allow the optimiser to come up with better cardinality estimates if columns in a table are correlated. Instead of me trying to explain the concept in depth I’d like to link to a post written by Maria Colgan. I strongly recommend you have a look at this post if you haven’t used extended statistics a lot.

When putting my post together I referred to the same SH.CUSTOMERS table as Maria. The queries I am using are based on another excellent post by Maria Colgan that builds on the foundations of the one I just referenced. In fact I really needed a data set that had correlated columns but I couldn’t come up with an example that was easy-to-follow until I found that blog entry. I suggest you have a look at this post first as it explains a nifty tool for helping you finding candidates for extended statistics in 11.2. Oracle 12c takes this concept a step further as you will see.

My test case

My test environment is 12.1.0.2.3 on Exadata (but that shouldn’t matter), and this is how it goes. First I create a table in my schema which is an exact replica of the table in the SH schema.

SQL> create table martin.customers as select * from sh.customers;

Table created.

The next step is to tell Oracle to monitor column usage. I have to accelerate this a bit, Oracle does this over the cause of the day anyway.

SQL> exec dbms_stats.seed_col_usage(null,null,300)

PL/SQL procedure successfully completed.

By calling seed_col_usage() without a SQLSET and OWNER I can report column usage. The PL/SQL API documentation reads:

This procedure also records group of columns. Extensions for the recorded group of columns can be created using the CREATE_EXTENDED_STATS Function procedure. If sqlset_name and owner_name are NULL, it records the column (group) usage information for the statements executed in the system in next time_limit seconds.

That sounds like what is needed. To start with a level playing field I gather statistics again using dbms_stats.gather_table_stats(user, ‘customers’).

Let there be queries

The next step is to run a few queries. Cardinality estimates are not correct-the optimiser does not “know” that L.A. can only be in California in this example. Pay attention to the number of rows returned and the cardinality estimates:

SQL> select count(*) from customers
  2  where cust_city = 'Los Angeles'
  3  and cust_state_province = 'CA'
  4  and country_id = 52790;

  COUNT(*)
----------
       932

Elapsed: 00:00:00.03
SQL> select * from table(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  ap71092cqnj1y, child number 0
-------------------------------------
select count(*) from customers where cust_city = 'Los Angeles' and
cust_state_province = 'CA' and country_id = 52790

Plan hash value: 296924608

----------------------------------------------------------------------------------------
| Id  | Operation                  | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |           |       |       |   423 (100)|          |
|   1 |  SORT AGGREGATE            |           |     1 |    26 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| CUSTOMERS |     1 |    26 |   423   (1)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND
              "COUNTRY_ID"=52790))
       filter(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND
              "COUNTRY_ID"=52790))

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold

27 rows selected.

Following advice I have been given by people who know a lot more about SQL tuning than I will ever comprehend, I gather execution statistics and then display actual and expected rows:

SQL> select /*+ gather_plan_statistics */ count(*) from customers
  2  where cust_city = 'Los Angeles'
  3  and cust_state_province = 'CA'
  4  and country_id = 52790;

  COUNT(*)
----------
       932

SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  0qmjk7qm3zwkw, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from customers  where
cust_city = 'Los Angeles'  and cust_state_province = 'CA'  and
country_id = 52790

Plan hash value: 296924608

--------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |           |      1 |        |      1 |00:00:00.01 |    1521 |
|   1 |  SORT AGGREGATE            |           |      1 |      1 |      1 |00:00:00.01 |    1521 |
|*  2 |   TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |      1 |    932 |00:00:00.01 |    1521 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND
              "COUNTRY_ID"=52790))
       filter(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND
              "COUNTRY_ID"=52790))

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold

The number of actual rows in plan line 2 is quite different from what the optimiser expected it to be.

I also executed the next example from Maria’s blog post:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------

SQL_ID  5h284vjm2xtp9, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ country_id, cust_state_province,
count(*) from customers group by country_id, cust_state_province

Plan hash value: 1577413243

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |           |      1 |        |    145 |00:00:00.02 |    1521 |       |       |          |
|   1 |  HASH GROUP BY             |           |      1 |   1949 |    145 |00:00:00.02 |    1521 |  1015K|  1015K| 1371K (0)|
|   2 |   TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |  55500 |  55500 |00:00:00.02 |    1521 |  1025K|  1025K|          |
-----------------------------------------------------------------------------------------------------------------------------

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold

Again cardinality estimates differ in plan line 1: instead of 1949 rows the optimiser expects to be returned there are only 145. This is a simple example, but quite often incorrect cardinality estimates lead to sub-optimal plan generation (although 12c has another new feature that tries to mitigate the effects of these incorrect cardinality estimates as well)

Column Groups

On the other hand the column usage monitor picked up information based on the “workload”. You can see this in sys.col_group_usage$

SQL> select u.*, o.object_name, object_type
  2  from sys.col_group_usage$ u, dba_objects o
  3  where u.obj# = o.data_object_id
  4  and owner = 'MARTIN';

      OBJ# COLS                           TIMESTAMP      FLAGS OBJECT_NAME                    OBJECT_TYPE
---------- ------------------------------ --------- ---------- ------------------------------ -----------------------
    111118 11,13                          26-JUN-15         36 CUSTOMERS                      TABLE
    111118 9,11,13                        26-JUN-15         33 CUSTOMERS                      TABLE

This is of course not the way to view this information-there’s an API for that.

SQL> select dbms_stats.report_col_usage(user, 'CUSTOMERS') from dual;

DBMS_STATS.REPORT_COL_USAGE(USER,'CUSTOMERS')
----------------------------------------------------------------------------------------------
LEGEND:
.......

EQ         : Used in single table EQuality predicate
RANGE      : Used in single table RANGE predicate
LIKE       : Used in single table LIKE predicate
NULL       : Used in single table is (not) NULL predicate
EQ_JOIN    : Used in EQuality JOIN predicate
NONEQ_JOIN : Used in NON EQuality JOIN predicate
FILTER     : Used in single table FILTER predicate
JOIN       : Used in JOIN predicate
GROUP_BY   : Used in GROUP BY expression
...............................................................................

###############################################################################

COLUMN USAGE REPORT FOR MARTIN.CUSTOMERS
........................................

1. COUNTRY_ID                          : EQ
2. CUST_CITY                           : EQ
3. CUST_STATE_PROVINCE                 : EQ
4. (CUST_CITY, CUST_STATE_PROVINCE,
    COUNTRY_ID)                        : FILTER
5. (CUST_STATE_PROVINCE, COUNTRY_ID)   : GROUP_BY
###############################################################################

So there is a potential candidates for extended stats in line 5. So far, nothing new. In version before 12c you now had to create the column groups manually. Again, please refer to Maria’s post for an example.

The difference in 12c

In 12c you don’t need to create these column groups manually. Consider this sequence of events:

SQL> select * from dba_stat_extensions where owner = 'MARTIN';

no rows selected

Elapsed: 00:00:00.28

SQL> exec dbms_stats.gather_table_stats(user,'CUSTOMERS')

PL/SQL procedure successfully completed.

SQL> select * from user_stat_extensions

TABLE_NAME                     EXTENSION_NAME                 EXTENSION                                          CREATO DRO
------------------------------ ------------------------------ -------------------------------------------------- ------ ---
CUSTOMERS                      SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ ("CUST_STATE_PROVINCE","COUNTRY_ID")               USER   YES
CUSTOMERS                      SYS_STUMZ$C3AIHLPBROI#SKA58H_N ("CUST_CITY","CUST_STATE_PROVINCE","COUNTRY_ID")   USER   YES

2 rows selected.

SQL> SELECT column_name,
  2    num_distinct,
  3    num_buckets,
  4    histogram
  5  FROM dba_tab_col_statistics
  6  WHERE table_name = 'CUSTOMERS'
  7  AND owner        = 'MARTIN'
  8  AND column_name LIKE 'SYS%';

COLUMN_NAME                    NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ----------- ---------------
SYS_STUMZ$C3AIHLPBROI#SKA58H_N          620         254 HYBRID
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_          145           1 NONE

2 rows selected.

So in the above output you can see that there are no extended stats on the CUSTOMERS table before gathering stats. There is nothing special about the stats gathering command, the preferences are the defaults (reformatted here for readability)

SQL> l
  1  declare
  2    type prefs_t is table of varchar2(150);
  3    prefs prefs_t := prefs_t(
  4      'AUTOSTATS_TARGET','CASCADE','CONCURRENT','DEGREE','ESTIMATE_PERCENT',
  5      'METHOD_OPT','NO_INVALIDATE','GRANULARITY','PUBLISH','INCREMENTAL',
  6      'INCREMENTAL_STALENESS','INCREMENTAL_LEVEL','STALE_PERCENT',
  7      'GLOBAL_TEMP_TABLE_STATS','TABLE_CACHED_BLOCKS','OPTIONS');
  8    prefs_value varchar2(100);
  9  begin
 10    for p in prefs.first .. prefs.last loop
 11      select dbms_stats.get_prefs(prefs(p), user,'CUSTOMERS') into prefs_value from dual;
 12      dbms_output.put_line(prefs(p) || ': ' || prefs_value);
 13    end loop;
 14* end;
SQL> /
AUTOSTATS_TARGET:        AUTO
CASCADE:                 DBMS_STATS.AUTO_CASCADE
CONCURRENT:              MANUAL
DEGREE:                  NULL
ESTIMATE_PERCENT:        DBMS_STATS.AUTO_SAMPLE_SIZE
METHOD_OPT:              FOR ALL COLUMNS SIZE AUTO
NO_INVALIDATE:           DBMS_STATS.AUTO_INVALIDATE
GRANULARITY:             AUTO
PUBLISH:                 TRUE
INCREMENTAL:             FALSE
INCREMENTAL_STALENESS:
INCREMENTAL_LEVEL:       PARTITION
STALE_PERCENT:           10
GLOBAL_TEMP_TABLE_STATS: SESSION
TABLE_CACHED_BLOCKS:     1
OPTIONS:                 GATHER

PL/SQL procedure successfully completed.

Did you see the extended stats have histograms? There is even one of the new fancy “hybrid” ones. The effect is noticeable:

SQL> select /*+ gather_plan_statistics */ count(*) from customers
  2   where cust_city = 'Los Angeles'
  3   and cust_state_province = 'CA'
  4   and country_id = 52790;

SQL> select * from table(dbms_xplan.display_cursor(format => 'allstats last'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------
SQL_ID  0qmjk7qm3zwkw, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from customers  where
cust_city = 'Los Angeles'  and cust_state_province = 'CA'  and
country_id = 52790

Plan hash value: 296924608

--------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |           |      1 |        |      1 |00:00:00.01 |    1521 |
|   1 |  SORT AGGREGATE            |           |      1 |      1 |      1 |00:00:00.01 |    1521 |
|*  2 |   TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |    969 |    932 |00:00:00.01 |    1521 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND
              "COUNTRY_ID"=52790))
       filter(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND
              "COUNTRY_ID"=52790))

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - 1 Sql Plan Directive used for this statement

The cardinality estimate is a lot closer now. What surprised me was the SQL Plan Directive referenced. But this is material for another blog post.

Responses

  1. Any idea how we turn off this default / automatic behaviour (which may be desirable, just like turning off default method_opt behaviour)?
    Is it just the case that if you don’t want it, don’t seed the col uage? Think it must be.
    There’s no related preference right?

    Anyway… fake/bad example coming up but hey… so Oracle does all this behind the scenes and then I suddenly decide that I didn’t mean for country_id to be a number but I wanted a varchar2.

    So I do:

    alter table customers modify country_id varchar2(30);
    

    And I get:

    ORA-54033: column to be modified is used in a virtual column expression
    

    And I’m thinking “hey, that’s annoying”
    :)

    1. Particularly annoying if your extended stats are not the same in all environments and you run a release script through all non-prod environments and then it fails in prod because of differences in extended stats.

      Again.. possibly fake/bad example – everyone runs releases against physical duplicates of prod at some point of the release cycle right?!?!? :)

    2. Yep, that’s a bit painful, especially if you think “hang on, I didn’t put these in…” :P

      I am currently researching part 2 that has interesting implications as well, I am hoping to write that up soon.

      Martin

  2. Is column usage monitoring enabled by default in 12c or do we still have to enable it explicitly (exec dbms_stats.seed_col_usage(null,null,300);)?

    Regards, Markus

    1. Short answer is in MOS Doc ID 1964223.1, and the Database SQL Tuning Guide long answer hopefully in a blog post quite soon :)

  3. Hi Martin,
    I’ve been looking in this too recently – it seems to be causing problems for streams. When the hidden columns are added the table ddl is then different and we get a new version of the MVDD that streams uses – this for some reason seems to ‘break’ propagation (silently) – although errors are present in server trace files.

    I was looking for a way to disable it and i assume locking table stats might be the only way – which isn’t that great a solution

    Cheers,
    Rich

  4. […] « Little things worth knowing: automatic generation of extended statistics in 12c […]

Blog at WordPress.com.