When you are migrating to Oracle 12c I hope you might this post useful. I came across this feature when researching what’s new with Oracle 12c (and yes I still find lots of new ones I haven’t noticed before). This one is a bit hidden away in section 2.2.4.3 Automatic Column Group Detection of the 12c New Features Guide. And it’s a lot more complex than I first thought! In this first post I’ll try and show the generation of extended statistics in 12c. I am planning on another post to explain how the rest of the adaptive optimisations that are new with 12c fit into the picture.
What is the motivation?
Previously, in Oracle versions up to 12c you needed to be on the lookout for candidates for correlated columns. Extended statistics on groups of columns allow the optimiser to come up with better cardinality estimates if columns in a table are correlated. Instead of me trying to explain the concept in depth I’d like to link to a post written by Maria Colgan. I strongly recommend you have a look at this post if you haven’t used extended statistics a lot.
When putting my post together I referred to the same SH.CUSTOMERS table as Maria. The queries I am using are based on another excellent post by Maria Colgan that builds on the foundations of the one I just referenced. In fact I really needed a data set that had correlated columns but I couldn’t come up with an example that was easy-to-follow until I found that blog entry. I suggest you have a look at this post first as it explains a nifty tool for helping you finding candidates for extended statistics in 11.2. Oracle 12c takes this concept a step further as you will see.
My test case
My test environment is 12.1.0.2.3 on Exadata (but that shouldn’t matter), and this is how it goes. First I create a table in my schema which is an exact replica of the table in the SH schema.
SQL> create table martin.customers as select * from sh.customers; Table created.
The next step is to tell Oracle to monitor column usage. I have to accelerate this a bit, Oracle does this over the cause of the day anyway.
SQL> exec dbms_stats.seed_col_usage(null,null,300) PL/SQL procedure successfully completed.
By calling seed_col_usage() without a SQLSET and OWNER I can report column usage. The PL/SQL API documentation reads:
This procedure also records group of columns. Extensions for the recorded group of columns can be created using the CREATE_EXTENDED_STATS Function procedure. If sqlset_name and owner_name are NULL, it records the column (group) usage information for the statements executed in the system in next time_limit seconds.
That sounds like what is needed. To start with a level playing field I gather statistics again using dbms_stats.gather_table_stats(user, ‘customers’).
Let there be queries
The next step is to run a few queries. Cardinality estimates are not correct-the optimiser does not “know” that L.A. can only be in California in this example. Pay attention to the number of rows returned and the cardinality estimates:
SQL> select count(*) from customers 2 where cust_city = 'Los Angeles' 3 and cust_state_province = 'CA' 4 and country_id = 52790; COUNT(*) ---------- 932 Elapsed: 00:00:00.03 SQL> select * from table(dbms_xplan.display_cursor); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------ SQL_ID ap71092cqnj1y, child number 0 ------------------------------------- select count(*) from customers where cust_city = 'Los Angeles' and cust_state_province = 'CA' and country_id = 52790 Plan hash value: 296924608 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 423 (100)| | | 1 | SORT AGGREGATE | | 1 | 26 | | | |* 2 | TABLE ACCESS STORAGE FULL| CUSTOMERS | 1 | 26 | 423 (1)| 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)) filter(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)) Note ----- - automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold 27 rows selected.
Following advice I have been given by people who know a lot more about SQL tuning than I will ever comprehend, I gather execution statistics and then display actual and expected rows:
SQL> select /*+ gather_plan_statistics */ count(*) from customers 2 where cust_city = 'Los Angeles' 3 and cust_state_province = 'CA' 4 and country_id = 52790; COUNT(*) ---------- 932 SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last')); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------ SQL_ID 0qmjk7qm3zwkw, child number 0 ------------------------------------- select /*+ gather_plan_statistics */ count(*) from customers where cust_city = 'Los Angeles' and cust_state_province = 'CA' and country_id = 52790 Plan hash value: 296924608 -------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 1521 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1521 | |* 2 | TABLE ACCESS STORAGE FULL| CUSTOMERS | 1 | 1 | 932 |00:00:00.01 | 1521 | -------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)) filter(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)) Note ----- - automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold
The number of actual rows in plan line 2 is quite different from what the optimiser expected it to be.
I also executed the next example from Maria’s blog post:
PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------------ SQL_ID 5h284vjm2xtp9, child number 0 ------------------------------------- select /*+ gather_plan_statistics */ country_id, cust_state_province, count(*) from customers group by country_id, cust_state_province Plan hash value: 1577413243 ----------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 145 |00:00:00.02 | 1521 | | | | | 1 | HASH GROUP BY | | 1 | 1949 | 145 |00:00:00.02 | 1521 | 1015K| 1015K| 1371K (0)| | 2 | TABLE ACCESS STORAGE FULL| CUSTOMERS | 1 | 55500 | 55500 |00:00:00.02 | 1521 | 1025K| 1025K| | ----------------------------------------------------------------------------------------------------------------------------- Note ----- - automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold
Again cardinality estimates differ in plan line 1: instead of 1949 rows the optimiser expects to be returned there are only 145. This is a simple example, but quite often incorrect cardinality estimates lead to sub-optimal plan generation (although 12c has another new feature that tries to mitigate the effects of these incorrect cardinality estimates as well)
Column Groups
On the other hand the column usage monitor picked up information based on the “workload”. You can see this in sys.col_group_usage$
SQL> select u.*, o.object_name, object_type 2 from sys.col_group_usage$ u, dba_objects o 3 where u.obj# = o.data_object_id 4 and owner = 'MARTIN'; OBJ# COLS TIMESTAMP FLAGS OBJECT_NAME OBJECT_TYPE ---------- ------------------------------ --------- ---------- ------------------------------ ----------------------- 111118 11,13 26-JUN-15 36 CUSTOMERS TABLE 111118 9,11,13 26-JUN-15 33 CUSTOMERS TABLE
This is of course not the way to view this information-there’s an API for that.
SQL> select dbms_stats.report_col_usage(user, 'CUSTOMERS') from dual; DBMS_STATS.REPORT_COL_USAGE(USER,'CUSTOMERS') ---------------------------------------------------------------------------------------------- LEGEND: ....... EQ : Used in single table EQuality predicate RANGE : Used in single table RANGE predicate LIKE : Used in single table LIKE predicate NULL : Used in single table is (not) NULL predicate EQ_JOIN : Used in EQuality JOIN predicate NONEQ_JOIN : Used in NON EQuality JOIN predicate FILTER : Used in single table FILTER predicate JOIN : Used in JOIN predicate GROUP_BY : Used in GROUP BY expression ............................................................................... ############################################################################### COLUMN USAGE REPORT FOR MARTIN.CUSTOMERS ........................................ 1. COUNTRY_ID : EQ 2. CUST_CITY : EQ 3. CUST_STATE_PROVINCE : EQ 4. (CUST_CITY, CUST_STATE_PROVINCE, COUNTRY_ID) : FILTER 5. (CUST_STATE_PROVINCE, COUNTRY_ID) : GROUP_BY ###############################################################################
So there is a potential candidates for extended stats in line 5. So far, nothing new. In version before 12c you now had to create the column groups manually. Again, please refer to Maria’s post for an example.
The difference in 12c
In 12c you don’t need to create these column groups manually. Consider this sequence of events:
SQL> select * from dba_stat_extensions where owner = 'MARTIN'; no rows selected Elapsed: 00:00:00.28 SQL> exec dbms_stats.gather_table_stats(user,'CUSTOMERS') PL/SQL procedure successfully completed. SQL> select * from user_stat_extensions TABLE_NAME EXTENSION_NAME EXTENSION CREATO DRO ------------------------------ ------------------------------ -------------------------------------------------- ------ --- CUSTOMERS SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ ("CUST_STATE_PROVINCE","COUNTRY_ID") USER YES CUSTOMERS SYS_STUMZ$C3AIHLPBROI#SKA58H_N ("CUST_CITY","CUST_STATE_PROVINCE","COUNTRY_ID") USER YES 2 rows selected. SQL> SELECT column_name, 2 num_distinct, 3 num_buckets, 4 histogram 5 FROM dba_tab_col_statistics 6 WHERE table_name = 'CUSTOMERS' 7 AND owner = 'MARTIN' 8 AND column_name LIKE 'SYS%'; COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------------------------ ------------ ----------- --------------- SYS_STUMZ$C3AIHLPBROI#SKA58H_N 620 254 HYBRID SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 145 1 NONE 2 rows selected.
So in the above output you can see that there are no extended stats on the CUSTOMERS table before gathering stats. There is nothing special about the stats gathering command, the preferences are the defaults (reformatted here for readability)
SQL> l 1 declare 2 type prefs_t is table of varchar2(150); 3 prefs prefs_t := prefs_t( 4 'AUTOSTATS_TARGET','CASCADE','CONCURRENT','DEGREE','ESTIMATE_PERCENT', 5 'METHOD_OPT','NO_INVALIDATE','GRANULARITY','PUBLISH','INCREMENTAL', 6 'INCREMENTAL_STALENESS','INCREMENTAL_LEVEL','STALE_PERCENT', 7 'GLOBAL_TEMP_TABLE_STATS','TABLE_CACHED_BLOCKS','OPTIONS'); 8 prefs_value varchar2(100); 9 begin 10 for p in prefs.first .. prefs.last loop 11 select dbms_stats.get_prefs(prefs(p), user,'CUSTOMERS') into prefs_value from dual; 12 dbms_output.put_line(prefs(p) || ': ' || prefs_value); 13 end loop; 14* end; SQL> / AUTOSTATS_TARGET: AUTO CASCADE: DBMS_STATS.AUTO_CASCADE CONCURRENT: MANUAL DEGREE: NULL ESTIMATE_PERCENT: DBMS_STATS.AUTO_SAMPLE_SIZE METHOD_OPT: FOR ALL COLUMNS SIZE AUTO NO_INVALIDATE: DBMS_STATS.AUTO_INVALIDATE GRANULARITY: AUTO PUBLISH: TRUE INCREMENTAL: FALSE INCREMENTAL_STALENESS: INCREMENTAL_LEVEL: PARTITION STALE_PERCENT: 10 GLOBAL_TEMP_TABLE_STATS: SESSION TABLE_CACHED_BLOCKS: 1 OPTIONS: GATHER PL/SQL procedure successfully completed.
Did you see the extended stats have histograms? There is even one of the new fancy “hybrid” ones. The effect is noticeable:
SQL> select /*+ gather_plan_statistics */ count(*) from customers 2 where cust_city = 'Los Angeles' 3 and cust_state_province = 'CA' 4 and country_id = 52790; SQL> select * from table(dbms_xplan.display_cursor(format => 'allstats last')); PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------------------------------------------- SQL_ID 0qmjk7qm3zwkw, child number 0 ------------------------------------- select /*+ gather_plan_statistics */ count(*) from customers where cust_city = 'Los Angeles' and cust_state_province = 'CA' and country_id = 52790 Plan hash value: 296924608 -------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 1521 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1521 | |* 2 | TABLE ACCESS STORAGE FULL| CUSTOMERS | 1 | 969 | 932 |00:00:00.01 | 1521 | -------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)) filter(("CUST_CITY"='Los Angeles' AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)) Note ----- - dynamic statistics used: dynamic sampling (level=2) - 1 Sql Plan Directive used for this statement
The cardinality estimate is a lot closer now. What surprised me was the SQL Plan Directive referenced. But this is material for another blog post.
Responses
Any idea how we turn off this default / automatic behaviour (which may be desirable, just like turning off default method_opt behaviour)?
Is it just the case that if you don’t want it, don’t seed the col uage? Think it must be.
There’s no related preference right?
Anyway… fake/bad example coming up but hey… so Oracle does all this behind the scenes and then I suddenly decide that I didn’t mean for country_id to be a number but I wanted a varchar2.
So I do:
And I get:
And I’m thinking “hey, that’s annoying”
:)
Particularly annoying if your extended stats are not the same in all environments and you run a release script through all non-prod environments and then it fails in prod because of differences in extended stats.
Again.. possibly fake/bad example – everyone runs releases against physical duplicates of prod at some point of the release cycle right?!?!? :)
Yep, that’s a bit painful, especially if you think “hang on, I didn’t put these in…” :P
I am currently researching part 2 that has interesting implications as well, I am hoping to write that up soon.
Martin
Is column usage monitoring enabled by default in 12c or do we still have to enable it explicitly (exec dbms_stats.seed_col_usage(null,null,300);)?
Regards, Markus
Short answer is in MOS Doc ID 1964223.1, and the Database SQL Tuning Guide long answer hopefully in a blog post quite soon :)
Hi Martin,
I’ve been looking in this too recently – it seems to be causing problems for streams. When the hidden columns are added the table ddl is then different and we get a new version of the MVDD that streams uses – this for some reason seems to ‘break’ propagation (silently) – although errors are present in server trace files.
I was looking for a way to disable it and i assume locking table stats might be the only way – which isn’t that great a solution
Cheers,
Rich
[…] « Little things worth knowing: automatic generation of extended statistics in 12c […]