Category Archives: 19c

Generating Table DDL in Oracle Database

Generating table DDL is a common requirement. Unfortunately it’s not quite common enough for me to remember the syntax by heart, so this post serves as a reference to myself how to do this. Hopefully it saves you a few minutes, too.

I used Oracle SQLDeveloper Command-Line (SQLcl) version: 22.1.1.0 build: 22.1.1.131.0820 for this post, connecting to an Oracle 19c database running on Linux. The DDL command you are reading about later is not new to version 22.1, that one just so happens to be the most current version at the time of writing.

Here are the links for downloading SQLcl:

SQLcl 22.1 comes with a lot more cool stuff, Jeff Smith wrote about the details here.

DBMS_METADATA

Table DDL (and a lot of other DDL for that matter) can be generated using DBMS_METADATA. This package has been part of the database for quite some time and is documented in the PL/SQL Packages and Types Guide. In cases where SQLcl is unavailable it’s still a more than viable option to combine calls to DBMS_METADATA.set_transform_param() with DBMS_METADATA.get_ddl(). If you are in this position you might find the following sections in the PL/SQL Packages and Types Guide useful:

Tables 107-23 and 107-25 respectively (referenced above) are key to when it comes to understanding the options SQLcl offers. More on that later.

SQLcl is a LOT easier to use than DBMS_METADATA

SQLcl provides a shortcut to using the package: rather than calling DBMS_METADATA.GET_DDL() you can make use of the ddl command instead:

SQL> help ddl
DDL
---

DDL generates the code to reconstruct the object listed.  Use the type option
for materialized views. Use the save options to save the DDL to a file.

DDL [<object_name> [<type>] [SAVE <filename>]]

Let’s use the command with the ORDERS table Swingbench provides:

SQL> show user
USER is "SOE"
SQL> ddl orders

  CREATE TABLE "SOE"."ORDERS" 
   (	"ORDER_ID" NUMBER(12,0) CONSTRAINT "ORDER_ORDER_ID_NN" NOT NULL ENABLE, 
	"ORDER_DATE" TIMESTAMP (6) WITH LOCAL TIME ZONE CONSTRAINT "ORDER_DATE_NN" NOT NULL ENABLE, 
	"ORDER_MODE" VARCHAR2(8), 
	"CUSTOMER_ID" NUMBER(12,0) CONSTRAINT "ORDER_CUSTOMER_ID_NN" NOT NULL ENABLE, 
	"ORDER_STATUS" NUMBER(2,0), 
	"ORDER_TOTAL" NUMBER(8,2), 
	"SALES_REP_ID" NUMBER(6,0), 
	"PROMOTION_ID" NUMBER(6,0), 
	"WAREHOUSE_ID" NUMBER(6,0), 
	"DELIVERY_TYPE" VARCHAR2(15), 
	"COST_OF_DELIVERY" NUMBER(6,0), 
	"WAIT_TILL_ALL_AVAILABLE" VARCHAR2(15), 
	"DELIVERY_ADDRESS_ID" NUMBER(12,0), 
	"CUSTOMER_CLASS" VARCHAR2(30), 
	"CARD_ID" NUMBER(12,0), 
	"INVOICE_ADDRESS_ID" NUMBER(12,0), 
	 CONSTRAINT "ORDER_MODE_LOV" CHECK (order_mode in ('direct','online')) DEFERRABLE ENABLE NOVALIDATE, 
	 CONSTRAINT "ORDER_TOTAL_MIN" CHECK (order_total >= 0) DEFERRABLE ENABLE NOVALIDATE, 
	 CONSTRAINT "ORDERS_CUSTOMER_ID_FK" FOREIGN KEY ("CUSTOMER_ID")
	  REFERENCES "SOE"."CUSTOMERS" ("CUSTOMER_ID") ON DELETE SET NULL ENABLE NOVALIDATE
   ) SEGMENT CREATION IMMEDIATE 
  PCTFREE 10 PCTUSED 40 INITRANS 16 MAXTRANS 255 
 NOCOMPRESS LOGGING
  STORAGE(INITIAL 8388608 NEXT 8388608 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "SOE_TBS" ;
  CREATE UNIQUE INDEX "SOE"."ORDER_PK" ON "SOE"."ORDERS" ("ORDER_ID") REVERSE 
  PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS NOLOGGING 
  STORAGE(INITIAL 1048576 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "SOE_TBS" ;
ALTER TABLE "SOE"."ORDERS" ADD CONSTRAINT "ORDER_PK" PRIMARY KEY ("ORDER_ID")
  USING INDEX "SOE"."ORDER_PK"  ENABLE NOVALIDATE;

  CREATE INDEX "SOE"."ORD_CUSTOMER_IX" ON "SOE"."ORDERS" ("CUSTOMER_ID") REVERSE 
  PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS NOLOGGING 
  STORAGE(INITIAL 1048576 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "SOE_TBS" ;

  CREATE INDEX "SOE"."ORD_ORDER_DATE_IX" ON "SOE"."ORDERS" ("ORDER_DATE") REVERSE 
  PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS NOLOGGING 
  STORAGE(INITIAL 1048576 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "SOE_TBS" ;

  CREATE INDEX "SOE"."ORD_SALES_REP_IX" ON "SOE"."ORDERS" ("SALES_REP_ID") REVERSE 
  PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS NOLOGGING 
  STORAGE(INITIAL 1048576 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "SOE_TBS" ;

  CREATE INDEX "SOE"."ORD_WAREHOUSE_IX" ON "SOE"."ORDERS" ("WAREHOUSE_ID", "ORDER_STATUS") 
  PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS NOLOGGING 
  STORAGE(INITIAL 1048576 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "SOE_TBS" ; 

Not only does the ddl command print the table DDL, it also includes the primary key as well as the table’s indexes! If you just want to recreate the table in a different (pluggable) database/same schema, then you’re done here. Optionally store the DDL in a file and put it into version control.

Customising Output

Sometimes however you might want to customise the output. That’s possible with SQLcl, too. The set ddl command can be used to that effect:

SQL> help set ddl
SET DDL
  SET DDL [[ PRETTY | SQLTERMINATOR | CONSTRAINTS | REF_CONSTRAINTS |
          CONSTRAINTS_AS_ALTER|OID | SIZE_BYTE_KEYWORD | PARTITIONING |
          SEGMENT_ATTRIBUTES | STORAGE | TABLESPACE | SPECIFICATION |
          BODY | FORCE | INSERT | |INHERIT | RESET] {on|off}
         ] | ON | OFF ]

Table 107-23 in the DBMS_METADATA package documentation describe the meaning of each of these. SQLTERMINATOR for example defines whether a SQL terminator such as the semi-colon or slash should be added after each statement. This defaults to FALSE in DBMS_METADATA, and TRUE in SQLcl. The current settings can be viewed using the show ddl command:

SQL> show ddl
STORAGE : ON
INHERIT : ON
EMIT_SCHEMA : ON
SQLTERMINATOR : ON
OID : ON
SPECIFICATION : ON
TABLESPACE : ON
SIZE_BYTE_KEYWORD : ON
PRETTY : ON
REF_CONSTRAINTS : ON
FORCE : ON
PARTITIONING : ON
CONSTRAINTS : ON
INSERT : ON
BODY : ON
CONSTRAINTS_AS_ALTER : ON
SEGMENT_ATTRIBUTES : ON

I don’t need the storage attributes, don’t want SQLcl to emit the schema and neither do I need the segment attributes. Although turning off SEGMENT_ATTRIBUTES disables STORAGE, too, I set this flag explicitly as a note to self.

SQL> set ddl STORAGE off
DDL Option STORAGE off
SQL> set ddl EMIT_SCHEMA off
DDL Option EMIT_SCHEMA off
SQL> set ddl SEGMENT_ATTRIBUTES off
DDL Option SEGMENT_ATTRIBUTES off

With the defaults changed to my use case the output is reduced quite a bit:

SQL> ddl orders

  CREATE TABLE "ORDERS" 
   (	"ORDER_ID" NUMBER(12,0) CONSTRAINT "ORDER_ORDER_ID_NN" NOT NULL ENABLE, 
	"ORDER_DATE" TIMESTAMP (6) WITH LOCAL TIME ZONE CONSTRAINT "ORDER_DATE_NN" NOT NULL ENABLE, 
	"ORDER_MODE" VARCHAR2(8), 
	"CUSTOMER_ID" NUMBER(12,0) CONSTRAINT "ORDER_CUSTOMER_ID_NN" NOT NULL ENABLE, 
	"ORDER_STATUS" NUMBER(2,0), 
	"ORDER_TOTAL" NUMBER(8,2), 
	"SALES_REP_ID" NUMBER(6,0), 
	"PROMOTION_ID" NUMBER(6,0), 
	"WAREHOUSE_ID" NUMBER(6,0), 
	"DELIVERY_TYPE" VARCHAR2(15), 
	"COST_OF_DELIVERY" NUMBER(6,0), 
	"WAIT_TILL_ALL_AVAILABLE" VARCHAR2(15), 
	"DELIVERY_ADDRESS_ID" NUMBER(12,0), 
	"CUSTOMER_CLASS" VARCHAR2(30), 
	"CARD_ID" NUMBER(12,0), 
	"INVOICE_ADDRESS_ID" NUMBER(12,0), 
	 CONSTRAINT "ORDER_MODE_LOV" CHECK (order_mode in ('direct','online')) DEFERRABLE ENABLE NOVALIDATE, 
	 CONSTRAINT "ORDER_TOTAL_MIN" CHECK (order_total >= 0) DEFERRABLE ENABLE NOVALIDATE, 
	 CONSTRAINT "ORDERS_CUSTOMER_ID_FK" FOREIGN KEY ("CUSTOMER_ID")
	  REFERENCES "CUSTOMERS" ("CUSTOMER_ID") ON DELETE SET NULL ENABLE NOVALIDATE
   ) ;
  CREATE UNIQUE INDEX "ORDER_PK" ON "ORDERS" ("ORDER_ID") REVERSE 
  ;
ALTER TABLE "ORDERS" ADD CONSTRAINT "ORDER_PK" PRIMARY KEY ("ORDER_ID")
  USING INDEX "ORDER_PK"  ENABLE NOVALIDATE;

  CREATE INDEX "ORD_CUSTOMER_IX" ON "ORDERS" ("CUSTOMER_ID") REVERSE 
  ;

  CREATE INDEX "ORD_ORDER_DATE_IX" ON "ORDERS" ("ORDER_DATE") REVERSE 
  ;

  CREATE INDEX "ORD_SALES_REP_IX" ON "ORDERS" ("SALES_REP_ID") REVERSE 
  ;

  CREATE INDEX "ORD_WAREHOUSE_IX" ON "ORDERS" ("WAREHOUSE_ID", "ORDER_STATUS") 
  ;
SQL> 

And this is exactly what I wanted ;) Instead of listing all the specifics in the storage clause and segment management I can run this script on a different system with different tablespaces and other settings. The downside is that I’m now responsible for providing useful defaults.

Summary

SQLcl provides as super convenient shortcut to using DBMS_METADATA. There are many use cases for this function, from creating DDL for use with version control systems to data migrations the possibilities are almost endless.

Advertisement

DOAG 2021 gems: DBMS_XPLAN.COMPARE_PLANS

The most excellent #DOAG2021 conference ended last week. I have attended quite a few presentations and took lots of notes. I particularly enjoyed Conner McDonald‘s presentation about 25 years of tips and techniques. One of these tips prompted this blog post ;)

Turns out I have only seen a change to DBMS_XPLAN in passing. Its functionality has been extended in 19c, allowing you to compare execution plans. So needless to say I wanted to try DBMS_XPLAN.COMPARE_PLANS in my lab. The 19c Packages and Types documentation defines the call as follows:

DBMS_XPLAN.COMPARE_PLANS(
   reference_plan    IN generic_plan_object,
   compare_plan_list IN plan_object_list,
   type              IN VARCHAR2 := 'TEXT',
   level             IN VARCHAR2 := 'TYPICAL',
   section           IN VARCHAR2 := 'ALL')  
 RETURN CLOB;

The meaning of the first two parameters isn’t immediately obvious, so back to the documentation again. A generic plan object contains a single “SQL plan” whereas a plan object list is an array of these. Different options to reference the “SQL Plan” exist. You can grab a “SQL plan” from the plan table, cursor cache and many others.

I’m sure there are license considerations to be taken into account here so be careful which option you choose! Actually this applies to this entire blog (apologies if I have said it before, but), if you want to follow along please ensure you are license compliant.

The idea as I see it is for you to pick a reference plan from a supported source and compare it with 1 or many other plans. OK I think I have enough to get started.

Running Queries

As always I’m using Swingbench and its Order Entry schema as an example. I specifically went for it as it has enough complexity to create some larger execution plans but not too complex to make it impossible to follow the example. I’m running Oracle 19c (19.12.0) Enterprise Edition on Oracle Linux 8.4 by the way.

I managed to come up with what I think is a suitable compromise for this post:

SQL> !cat query.sql
set timing on echo on

SELECT /*+ gather_plan_statistics */
    o.order_id,
    SUM(oi.unit_price * oi.quantity) AS revenue,
    p.category_id,
    o.order_date
FROM
         orders o
    JOIN order_items oi ON ( o.order_id = oi.order_id )
    JOIN products    p ON ( p.product_id = oi.product_id )
WHERE
    o.order_date BETWEEN TIMESTAMP '2007-01-01 13:00:00' AND TIMESTAMP '2007-01-01 13:30:00'
GROUP BY
    o.order_id,
    p.category_id,
    o.order_date
ORDER BY
    o.order_id;
    
set timing off echo off

I’m joining PRODUCTS (a view), ORDERS and ORDER_ITEMS before applying a filter predicate. This requires the use of a timestamp as ORDERS.ORDER_DATE is a TIMESTAMP(6) WITH LOCAL TIME ZONE

Let’s execute the query:

@query

...

  45740079       8082           3 01-JAN-07 01.00.00.000000 PM
  45740079       5960          83 01-JAN-07 01.00.00.000000 PM

3552 rows selected.

Elapsed: 00:00:23.37
SQL> 
SQL> set timing off echo off

I was a little surprised about the elapsed time as it seems a little long. Both tables are partitioned using the hash partitioning scheme offered by oewizard. Here are some other stats worth knowing:

  • ORDERS: 45,924,841 rows occupying 5632 MB of disk space (without indices)
  • ORDER_ITEMS: 232,170,100 rows for 16128 MB of disk space (again without indices)

So let’s try and work out why the query took quite some time to complete. I’m using Tanel Poder’s excellent tpt-oracle scripts for this. Immediately after the statement finishes executing I use x.sql:

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  gtr9hy6x492p7, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */     o.order_id,
SUM(oi.unit_price * oi.quantity) AS revenue,     p.category_id,
o.order_date FROM          orders o     JOIN order_items oi ON (
o.order_id = oi.order_id )     JOIN products    p ON ( p.product_id =
oi.product_id ) WHERE     o.order_date BETWEEN TIMESTAMP '2007-01-01
13:00:00' AND TIMESTAMP '2007-01-01 13:30:00' GROUP BY     o.order_id,
   p.category_id,     o.order_date ORDER BY     o.order_id

Plan hash value: 1832779287

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                | Starts | E-Rows |E-Bytes| Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                     |      1 |        |       |   198K(100)|       |       |   3552 |00:00:23.30 |     717K|    714K|       |       |          |
|   1 |  SORT GROUP BY                          |                     |      1 |   2328 |   104K|   198K  (1)|       |       |   3552 |00:00:23.30 |     717K|    714K|   372K|   372K|  330K (0)|
|*  2 |   FILTER                                |                     |      1 |        |       |            |       |       |   3648 |00:00:13.75 |     717K|    714K|       |       |          |
|*  3 |    HASH JOIN RIGHT OUTER                |                     |      1 |   2328 |   104K|   198K  (1)|       |       |   3648 |00:00:13.75 |     717K|    714K|  2546K|  2546K| 1622K (0)|
|   4 |     INDEX FAST FULL SCAN                | PRD_DESC_PK         |      1 |   1000 |  4000 |     3   (0)|       |       |   1000 |00:00:00.01 |      17 |      2 |       |       |          |
|*  5 |     HASH JOIN                           |                     |      1 |   2328 | 97776 |   198K  (1)|       |       |   3648 |00:00:11.13 |     717K|    714K|  1995K|  1995K| 1674K (0)|
|   6 |      TABLE ACCESS FULL                  | PRODUCT_INFORMATION |      1 |   1000 |  8000 |     9   (0)|       |       |   1000 |00:00:00.01 |      30 |     10 |       |       |          |
|   7 |      NESTED LOOPS                       |                     |      1 |   2328 | 79152 |   198K  (1)|       |       |   3648 |00:00:11.13 |     717K|    714K|       |       |          |
|   8 |       NESTED LOOPS                      |                     |      1 |   2328 | 79152 |   198K  (1)|       |       |   3648 |00:00:13.73 |     717K|    714K|       |       |          |
|   9 |        PARTITION HASH ALL               |                     |      1 |    463 |  7871 |   196K  (1)|     1 |    32 |    768 |00:00:19.29 |     714K|    714K|       |       |          |
|* 10 |         TABLE ACCESS FULL               | ORDERS              |     32 |    463 |  7871 |   196K  (1)|     1 |    32 |    768 |00:00:19.82 |     714K|    714K|       |       |          |
|* 11 |        INDEX RANGE SCAN                 | ITEM_ORDER_IX       |    768 |      5 |       |     3   (0)|       |       |   3648 |00:00:00.04 |    2322 |     42 |       |       |          |
|  12 |       TABLE ACCESS BY GLOBAL INDEX ROWID| ORDER_ITEMS         |   3648 |      5 |    85 |     5   (0)| ROWID | ROWID |   3648 |00:00:00.03 |     801 |     30 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TIMESTAMP' 2007-01-01 13:30:00.000000000'>=TIMESTAMP' 2007-01-01 13:00:00.000000000')
   3 - access("D"."PRODUCT_ID"="I"."PRODUCT_ID")
   5 - access("I"."PRODUCT_ID"="OI"."PRODUCT_ID")
  10 - filter(("O"."ORDER_DATE"<=TIMESTAMP' 2007-01-01 13:30:00.000000000' AND "O"."ORDER_DATE">=TIMESTAMP' 2007-01-01 13:00:00.000000000'))
  11 - access("O"."ORDER_ID"="OI"."ORDER_ID")


39 rows selected.

Hmmm, that’s a bit strange: instead of a full scan on ORDERS (line 10) I would have expected the use of an index. I know there is one ;) And here is proof:

SQL> @ind order_date
Display indexes where table or index name matches %order_date%...

TABLE_OWNER          TABLE_NAME                     INDEX_NAME                     POS# COLUMN_NAME                    DSC
-------------------- ------------------------------ ------------------------------ ---- ------------------------------ ----
SOE                  ORDERS                         ORD_ORDER_DATE_IX                 1 ORDER_DATE


INDEX_OWNER          TABLE_NAME                     INDEX_NAME                     IDXTYPE    UNIQ STATUS   PART TEMP  H     LFBLKS           NDK   NUM_ROWS       CLUF LAST_ANALYZED       DEGREE VISIBILIT
-------------------- ------------------------------ ------------------------------ ---------- ---- -------- ---- ---- -- ---------- ------------- ---------- ---------- ------------------- ------ ---------
SOE                  ORDERS                         ORD_ORDER_DATE_IX              NORMAL/REV NO   VALID    NO   N     4     148859        632130   46338862   46298397 2021-11-23 07:06:01 1      VISIBLE
SQL> 

Oh hang on a sec: ORDER_DATE does have an index, but it’s a reverse key index. This does have a few implications as explained by Richard Foote, let’s try and see if a “regular” index makes a difference.

Note that changing the index type might very well cause issues unrelated to this particular query. There is almost certainly a reason why the index was created as a reverse key index so by “fixing” this issue you can end up introducing another. Or multiple others.

– Lesson learned the hard way after stuff broke

By the way, further executions of the query didn’t change the elapsed time, they occasionally resulted in the creation an additional child cursor thanks to statistics feedback.

Let’s recreate the index as a non-reverse index:

SQL> alter index ORD_ORDER_DATE_IX rebuild noreverse parallel 8;

Index altered.

SQL> alter index ORD_ORDER_DATE_IX noparallel;

Index altered.

Running the query again gives me a different result:

SQL> @query

...

  45623157       5700         141 01-JAN-07 01.00.00.000000 PM
  45623157       5455         192 01-JAN-07 01.00.00.000000 PM
  45740079       8082           3 01-JAN-07 01.00.00.000000 PM
  45740079       5960          83 01-JAN-07 01.00.00.000000 PM

3552 rows selected.

Elapsed: 00:00:00.06
SQL> 

Right, so there is a difference in elapsed time :) Let’s check if the index was used:

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  gtr9hy6x492p7, child number 1
-------------------------------------
SELECT /*+ gather_plan_statistics */     o.order_id,
SUM(oi.unit_price * oi.quantity) AS revenue,     p.category_id,
o.order_date FROM          orders o     JOIN order_items oi ON (
o.order_id = oi.order_id )     JOIN products    p ON ( p.product_id =
oi.product_id ) WHERE     o.order_date BETWEEN TIMESTAMP '2007-01-01
13:00:00' AND TIMESTAMP '2007-01-01 13:30:00' GROUP BY     o.order_id,
   p.category_id,     o.order_date ORDER BY     o.order_id

Plan hash value: 4255998723

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                        | Name                | Starts | E-Rows |E-Bytes| Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                 |                     |      1 |        |       |  2801 (100)|       |       |   3552 |00:00:00.03 |    3945 |    772 |       |       |          |
|   1 |  SORT GROUP BY                                   |                     |      1 |   2328 |   104K|  2801   (1)|       |       |   3552 |00:00:00.03 |    3945 |    772 |   302K|   302K|  268K (0)|
|*  2 |   FILTER                                         |                     |      1 |        |       |            |       |       |   3648 |00:00:00.01 |    3945 |    772 |       |       |          |
|*  3 |    HASH JOIN RIGHT OUTER                         |                     |      1 |   2328 |   104K|  2800   (1)|       |       |   3648 |00:00:00.01 |    3945 |    772 |  2546K|  2546K|  353K (0)|
|   4 |     INDEX FAST FULL SCAN                         | PRD_DESC_PK         |      1 |   1000 |  4000 |     3   (0)|       |       |   1000 |00:00:00.01 |      17 |      0 |       |       |          |
|*  5 |     HASH JOIN                                    |                     |      1 |   2328 | 97776 |  2797   (1)|       |       |   3648 |00:00:00.01 |    3928 |    772 |  1995K|  1995K|  353K (0)|
|   6 |      TABLE ACCESS FULL                           | PRODUCT_INFORMATION |      1 |   1000 |  8000 |     9   (0)|       |       |   1000 |00:00:00.01 |      30 |      0 |       |       |          |
|   7 |      NESTED LOOPS                                |                     |      1 |   2328 | 79152 |  2788   (1)|       |       |   3648 |00:00:00.01 |    3898 |    772 |       |       |          |
|   8 |       NESTED LOOPS                               |                     |      1 |   2328 | 79152 |  2788   (1)|       |       |   3648 |00:00:00.01 |    3097 |    772 |       |       |          |
|   9 |        TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| ORDERS              |      1 |    463 |  7871 |   472   (0)| ROWID | ROWID |    768 |00:00:00.41 |     775 |    772 |       |       |          |
|* 10 |         INDEX RANGE SCAN                         | ORD_ORDER_DATE_IX   |      1 |    467 |       |     5   (0)|       |       |    768 |00:00:00.01 |       8 |      5 |       |       |          |
|* 11 |        INDEX RANGE SCAN                          | ITEM_ORDER_IX       |    768 |      5 |       |     3   (0)|       |       |   3648 |00:00:00.01 |    2322 |      0 |       |       |          |
|  12 |       TABLE ACCESS BY GLOBAL INDEX ROWID         | ORDER_ITEMS         |   3648 |      5 |    85 |     5   (0)| ROWID | ROWID |   3648 |00:00:00.01 |     801 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TIMESTAMP' 2007-01-01 13:30:00.000000000'>=TIMESTAMP' 2007-01-01 13:00:00.000000000')
   3 - access("D"."PRODUCT_ID"="I"."PRODUCT_ID")
   5 - access("I"."PRODUCT_ID"="OI"."PRODUCT_ID")
  10 - access("O"."ORDER_DATE">=TIMESTAMP' 2007-01-01 13:00:00.000000000' AND "O"."ORDER_DATE"<=TIMESTAMP' 2007-01-01 13:30:00.000000000')
  11 - access("O"."ORDER_ID"="OI"."ORDER_ID")


39 rows selected.

It very much looks like the change from reverse to non-reverse index provided plenty of benefit for this query. If regression testing showed no problems with the other workloads on this system there is a strong argument to put this into production.

SQL Plan Analysis

At this point in time V$SQL shows 3 entries for SQL_ID gtr9hy6x492p7:

  • Plan Hash Value 1832779287: original execution using reverse key index (child cursor number 0)
  • Plan Hash Value 4255998723: using the index rebuilt as non-reverse key (child cursor number 1 and 2)

Child cursor 3 uses the same Plan Hash Value despite having used statistics feedback. In this demo run a second execution against the reverse-key index didn’t trigger statistics feedback although I have seen it during earlier tests.

While it’s fairly obvious to the human eye where the differences are between child cursor 0 and the others, let’s see what Oracle comes up with.

With the information provided by x.sql I can compare plans from the cursor cache. I rewrote the code example from the documentation a little, the end result however is the same.

var report clob

BEGIN
    :report := dbms_xplan.compare_plans(
        reference_plan => cursor_cache_object('gtr9hy6x492p7', 0), 
        compare_plan_list => plan_object_list(
            cursor_cache_object('gtr9hy6x492p7', 1), 
            cursor_cache_object('gtr9hy6x492p7', 2)
        ), 
        type => 'TEXT');
END;
/

print report

And here is the report. Tanel’s login.sql does a great job formatting the output by the way:

REPORT
-----------------------------------------------------------------------------------------------------------------

COMPARE PLANS REPORT
---------------------------------------------------------------------------------------------
  Current user           : SOE
  Total number of plans  : 3
  Number of findings     : 6
---------------------------------------------------------------------------------------------

COMPARISON DETAILS
---------------------------------------------------------------------------------------------
 Plan Number            : 1 (Reference Plan)
 Plan Found             : Yes
 Plan Source            : Cursor Cache
 SQL ID                 : gtr9hy6x492p7
 Child Number           : 0
 Plan Database Version  : 19.0.0.0
 Parsing Schema         : "SOE"
 SQL Text               : SELECT /*+ gather_plan_statistics */ o.order_id,
                        SUM(oi.unit_price * oi.quantity) AS revenue,
                        p.category_id, o.order_date FROM orders o JOIN
                        order_items oi ON ( o.order_id = oi.order_id ) JOIN
                        products p ON ( p.product_id = oi.product_id ) WHERE
                        o.order_date BETWEEN TIMESTAMP '2007-01-01 13:00:00'
                        AND TIMESTAMP '2007-01-01 13:30:00' GROUP BY
                        o.order_id, p.category_id, o.order_date ORDER BY
                        o.order_id

Plan
-----------------------------

 Plan Hash Value  : 1832779287

--------------------------------------------------------------------------------------------------------------
| Id   | Operation                                 | Name                | Rows | Bytes  | Cost   | Time     |
--------------------------------------------------------------------------------------------------------------
|    0 | SELECT STATEMENT                          |                     |      |        | 198882 |          |
|    1 |   SORT GROUP BY                           |                     | 2328 | 107088 | 198882 | 00:00:08 |
|  * 2 |    FILTER                                 |                     |      |        |        |          |
|  * 3 |     HASH JOIN RIGHT OUTER                 |                     | 2328 | 107088 | 198881 | 00:00:08 |
|    4 |      INDEX FAST FULL SCAN                 | PRD_DESC_PK         | 1000 |   4000 |      3 | 00:00:01 |
|  * 5 |      HASH JOIN                            |                     | 2328 |  97776 | 198877 | 00:00:08 |
|    6 |       TABLE ACCESS FULL                   | PRODUCT_INFORMATION | 1000 |   8000 |      9 | 00:00:01 |
|    7 |       NESTED LOOPS                        |                     | 2328 |  79152 | 198868 | 00:00:08 |
|    8 |        NESTED LOOPS                       |                     | 2328 |  79152 | 198868 | 00:00:08 |
|    9 |         PARTITION HASH ALL                |                     |  463 |   7871 | 196553 | 00:00:08 |
| * 10 |          TABLE ACCESS FULL                | ORDERS              |  463 |   7871 | 196553 | 00:00:08 |
| * 11 |         INDEX RANGE SCAN                  | ITEM_ORDER_IX       |    5 |        |      3 | 00:00:01 |
|   12 |        TABLE ACCESS BY GLOBAL INDEX ROWID | ORDER_ITEMS         |    5 |     85 |      5 | 00:00:01 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 2 - filter(TIMESTAMP' 2007-01-01 13:30:00.000000000'>=TIMESTAMP' 2007-01-01 13:00:00.000000000')
* 3 - access("D"."PRODUCT_ID"="I"."PRODUCT_ID")
* 5 - access("I"."PRODUCT_ID"="OI"."PRODUCT_ID")
* 10 - filter(("O"."ORDER_DATE"<=TIMESTAMP' 2007-01-01 13:30:00.000000000' AND "O"."ORDER_DATE">=TIMESTAMP' 2007-01-01 13:00:00.000000000'))
* 11 - access("O"."ORDER_ID"="OI"."ORDER_ID")

---------------------------------------------------------------------------------------------
 Plan Number            : 2
 Plan Found             : Yes
 Plan Source            : Cursor Cache
 SQL ID                 : gtr9hy6x492p7
 Child Number           : 1
 Plan Database Version  : 19.0.0.0
 Parsing Schema         : "SOE"
 SQL Text               : SELECT /*+ gather_plan_statistics */ o.order_id,
                        SUM(oi.unit_price * oi.quantity) AS revenue,
                        p.category_id, o.order_date FROM orders o JOIN
                        order_items oi ON ( o.order_id = oi.order_id ) JOIN
                        products p ON ( p.product_id = oi.product_id ) WHERE
                        o.order_date BETWEEN TIMESTAMP '2007-01-01 13:00:00'
                        AND TIMESTAMP '2007-01-01 13:30:00' GROUP BY
                        o.order_id, p.category_id, o.order_date ORDER BY
                        o.order_id

Plan
-----------------------------

 Plan Hash Value  : 4255998723

---------------------------------------------------------------------------------------------------------------------
| Id   | Operation                                          | Name                | Rows | Bytes  | Cost | Time     |
---------------------------------------------------------------------------------------------------------------------
|    0 | SELECT STATEMENT                                   |                     |      |        | 2801 |          |
|    1 |   SORT GROUP BY                                    |                     | 2328 | 107088 | 2801 | 00:00:01 |
|  * 2 |    FILTER                                          |                     |      |        |      |          |
|  * 3 |     HASH JOIN RIGHT OUTER                          |                     | 2328 | 107088 | 2800 | 00:00:01 |
|    4 |      INDEX FAST FULL SCAN                          | PRD_DESC_PK         | 1000 |   4000 |    3 | 00:00:01 |
|  * 5 |      HASH JOIN                                     |                     | 2328 |  97776 | 2797 | 00:00:01 |
|    6 |       TABLE ACCESS FULL                            | PRODUCT_INFORMATION | 1000 |   8000 |    9 | 00:00:01 |
|    7 |       NESTED LOOPS                                 |                     | 2328 |  79152 | 2788 | 00:00:01 |
|    8 |        NESTED LOOPS                                |                     | 2328 |  79152 | 2788 | 00:00:01 |
|    9 |         TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED | ORDERS              |  463 |   7871 |  472 | 00:00:01 |
| * 10 |          INDEX RANGE SCAN                          | ORD_ORDER_DATE_IX   |  467 |        |    5 | 00:00:01 |
| * 11 |         INDEX RANGE SCAN                           | ITEM_ORDER_IX       |    5 |        |    3 | 00:00:01 |
|   12 |        TABLE ACCESS BY GLOBAL INDEX ROWID          | ORDER_ITEMS         |    5 |     85 |    5 | 00:00:01 |
---------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 2 - filter(TIMESTAMP' 2007-01-01 13:30:00.000000000'>=TIMESTAMP' 2007-01-01 13:00:00.000000000')
* 3 - access("D"."PRODUCT_ID"="I"."PRODUCT_ID")
* 5 - access("I"."PRODUCT_ID"="OI"."PRODUCT_ID")
* 10 - access("O"."ORDER_DATE">=TIMESTAMP' 2007-01-01 13:00:00.000000000' AND "O"."ORDER_DATE"<=TIMESTAMP' 2007-01-01 13:30:00.000000000')
* 11 - access("O"."ORDER_ID"="OI"."ORDER_ID")


Comparison Results (3):
-----------------------------
 1. Query block SEL$BD98F06C, Alias "O"@"SEL$1": Some lines (id: 9) in the
    reference plan are missing in the current plan.
 2. Query block SEL$BD98F06C, Alias "O"@"SEL$1": Some lines (id: 10) in the
    current plan are missing in the reference plan.
 3. Query block SEL$BD98F06C, Alias "O"@"SEL$1": Some columns (OPTIONS, ID,
    PARENT_ID, DEPTH, PARTITION_START, PARTITION_STOP) do not match between the
    reference plan (id: 10) and the current plan (id: 9).


---------------------------------------------------------------------------------------------
 Plan Number            : 3
 Plan Found             : Yes
 Plan Source            : Cursor Cache
 SQL ID                 : gtr9hy6x492p7
 Child Number           : 2
 Plan Database Version  : 19.0.0.0
 Parsing Schema         : "SOE"
 SQL Text               : SELECT /*+ gather_plan_statistics */ o.order_id,
                        SUM(oi.unit_price * oi.quantity) AS revenue,
                        p.category_id, o.order_date FROM orders o JOIN
                        order_items oi ON ( o.order_id = oi.order_id ) JOIN
                        products p ON ( p.product_id = oi.product_id ) WHERE
                        o.order_date BETWEEN TIMESTAMP '2007-01-01 13:00:00'
                        AND TIMESTAMP '2007-01-01 13:30:00' GROUP BY
                        o.order_id, p.category_id, o.order_date ORDER BY
                        o.order_id

Plan
-----------------------------

 Plan Hash Value  : 4255998723

---------------------------------------------------------------------------------------------------------------------
| Id   | Operation                                          | Name                | Rows | Bytes  | Cost | Time     |
---------------------------------------------------------------------------------------------------------------------
|    0 | SELECT STATEMENT                                   |                     |      |        | 2801 |          |
|    1 |   SORT GROUP BY                                    |                     | 2328 | 107088 | 2801 | 00:00:01 |
|  * 2 |    FILTER                                          |                     |      |        |      |          |
|  * 3 |     HASH JOIN RIGHT OUTER                          |                     | 2328 | 107088 | 2800 | 00:00:01 |
|    4 |      INDEX FAST FULL SCAN                          | PRD_DESC_PK         | 1000 |   4000 |    3 | 00:00:01 |
|  * 5 |      HASH JOIN                                     |                     | 2328 |  97776 | 2797 | 00:00:01 |
|    6 |       TABLE ACCESS FULL                            | PRODUCT_INFORMATION | 1000 |   8000 |    9 | 00:00:01 |
|    7 |       NESTED LOOPS                                 |                     | 2328 |  79152 | 2788 | 00:00:01 |
|    8 |        NESTED LOOPS                                |                     | 2328 |  79152 | 2788 | 00:00:01 |
|    9 |         TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED | ORDERS              |  463 |   7871 |  472 | 00:00:01 |
| * 10 |          INDEX RANGE SCAN                          | ORD_ORDER_DATE_IX   |  467 |        |    5 | 00:00:01 |
| * 11 |         INDEX RANGE SCAN                           | ITEM_ORDER_IX       |    5 |        |    3 | 00:00:01 |
|   12 |        TABLE ACCESS BY GLOBAL INDEX ROWID          | ORDER_ITEMS         |    5 |     85 |    5 | 00:00:01 |
---------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 2 - filter(TIMESTAMP' 2007-01-01 13:30:00.000000000'>=TIMESTAMP' 2007-01-01 13:00:00.000000000')
* 3 - access("D"."PRODUCT_ID"="I"."PRODUCT_ID")
* 5 - access("I"."PRODUCT_ID"="OI"."PRODUCT_ID")
* 10 - access("O"."ORDER_DATE">=TIMESTAMP' 2007-01-01 13:00:00.000000000' AND "O"."ORDER_DATE"<=TIMESTAMP' 2007-01-01 13:30:00.000000000')
* 11 - access("O"."ORDER_ID"="OI"."ORDER_ID")


Notes
-----
- cardinality_feedback = yes


Comparison Results (3):
-----------------------------
 1. Query block SEL$BD98F06C, Alias "O"@"SEL$1": Some lines (id: 9) in the
    reference plan are missing in the current plan.
 2. Query block SEL$BD98F06C, Alias "O"@"SEL$1": Some lines (id: 10) in the
    current plan are missing in the reference plan.
 3. Query block SEL$BD98F06C, Alias "O"@"SEL$1": Some columns (OPTIONS, ID,
    PARENT_ID, DEPTH, PARTITION_START, PARTITION_STOP) do not match between the
    reference plan (id: 10) and the current plan (id: 9).


---------------------------------------------------------------------------------------------

How very nice! So Oracle reviews the plans with regards to the reference and point out what’s different.

Summary

DBMS_XPLAN.COMPARE_PLANS looks like a great addition to the package and it helps discovering differences between SQL execution plans. The report-which looks pretty in HTML by the way-points out differences in columns as well as rows: additional/missing lines are pointed out as well as different access paths as you can see in the report above.

I should point out again that I haven’t studied the license guide, as always please ensure you are appropriately licensed for all technology you use.

Happy troubleshooting!

The performance story that didn’t turn out as one: a case of or-expansion and how the database mitigated the problem on its own

Recently I had a bit of time to study the effect of an optimizer query transformation, the so-called or-expansion in Oracle 19c. I thought there might be performance implications with statements using bind variables transformed in this way. My limited testing suggests that isn’t necessarily the case as the optimiser is remarkably resilient.

Still I thought my investigation was worth writing down, I hope you might agree and better still, find the read enjoyable.

Excuse me, what exactly is an Or-Expansion?

Quoting from the SQL Tuning Guide, during an “or expansion the optimiser transforms a query block containing top-level disjunctions into the form of a UNION ALL query the contains 2 or more branches”.

The remainder of this article sheds some light on the query transformation. By the way, there are many, many other blog articles out there covering or-expansion. Some even include an analysis of a 10053 trace! The subject is also covered in Troubleshooting Oracle Performance by Christian Antognini, a great read.

Example setup

This is one of the few times I deviated from my trusted Swingbench environment. Instead I created the following segments in my 19.12.0 database running on Oracle Linux 7.9/UEK 6.

CREATE TABLE skewed_data_distribution
    AS
        WITH generator AS (
            SELECT
                ROWNUM id
            FROM
                dual
            CONNECT BY
                ROWNUM <= 4000
        )
        SELECT
            ROWNUM                    AS id,
            CASE
                WHEN mod(ROWNUM, 100000) = 0     THEN 1
                WHEN mod(ROWNUM, 1000) = 0       THEN 2
                WHEN mod(ROWNUM, 100) = 0        THEN 3
                ELSE 4
            END                       AS skew,
            lpad('*', 150, '*')       AS pad,
            sysdate + dbms_random.value(-1000,0) as datecol
        FROM
            generator,
            generator
        WHERE
            ROWNUM <= 10e6;

CREATE INDEX i_skew_1 ON
    skewed_data_distribution (
        id
    );

CREATE INDEX i_skew_2 ON
    skewed_data_distribution (
        skew
    );

BEGIN
 dbms_stats.gather_table_stats(
  ownname => 'MARTIN', 
  tabname => 'SKEWED_DATA_DISTRIBUTION',
  method_opt => 'for all columns size auto, for columns size 254 skew',
  degree => 4
 );
END;
/

Thanks to Jonathan Lewis for elaborating on how to create sample data safely.

By the way I didn’t enable any of the fix_controls that come with the 19.12 Release Update (RU). If you just raised an eyebrow, please have look at Mike Dietrich’s blog for details about a potential call to DBMS_OPTIM_BUNDLE after applying a RU.

The query I’ll use for this article is this (it’s designed to trigger an or-expansion).

var the_id number
var the_skew number

WITH q AS (
    SELECT id,
        skew
    FROM
        skewed_data_distribution
    WHERE
        id = :the_id
        OR skew = :the_skew
)
SELECT /*+ gather_plan_statistics opt_param('_b_tree_bitmap_plans','false') */
    COUNT(*)
FROM
    q;

Please ignore the fact that it is a rather useless SQL statement on its own, but it helped me create a test case. All I needed was a table, some indexes and a suitable data distribution as well as a histogram on SKEW, otherwise the optimiser probably wouldn’t have considered the use of the index, but I’m getting ahead of myself.

When I first tested the query I didn’t get the or-expansion I wanted, but rather this:

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
Plan hash value: 3063879156

---------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |     1 |     9 |  4872   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE                  |          |     1 |     9 |            |          |
|   2 |   BITMAP CONVERSION COUNT        |          |  2500K|    21M|  4872   (1)| 00:00:01 |
|   3 |    BITMAP OR                     |          |       |       |            |          |
|   4 |     BITMAP CONVERSION FROM ROWIDS|          |       |       |            |          |
|*  5 |      INDEX RANGE SCAN            | I_SKEW_2 |       |       |  4860   (1)| 00:00:01 |
|   6 |     BITMAP CONVERSION FROM ROWIDS|          |       |       |            |          |
|*  7 |      INDEX RANGE SCAN            | I_SKEW_1 |       |       |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("SKEW"=TO_NUMBER(:THE_SKEW))
   7 - access("ID"=TO_NUMBER(:THE_ID))

20 rows selected.

That bitmap conversion wasn’t quite what I had in mind, hence the opt_param hint in the query text.

Data distribution

The following detail about data distribution might help understand the article better:

REM data distribution: SKEW

SELECT /*+ parallel */
    COUNT(*),
    skew
FROM
    skewed_data_distribution
GROUP BY
    skew;

  COUNT(*)       SKEW
---------- ----------
       100          1
      9900          2
     90000          3
   9900000          4

4 rows selected.

REM Histograms  

SELECT
    column_name,
    histogram,
    num_buckets,
    column_id
FROM
    user_tab_columns
WHERE
    table_name = 'SKEWED_DATA_DISTRIBUTION';

COLUMN_NAM HISTOGRAM           NUM_BUCKETS       COLUMN_ID
---------- --------------- --------------- ---------------
ID         NONE                          1               1
SKEW       FREQUENCY                     4               2
PAD        NONE                          1               3
DATECOL    NONE                          1               4

4 rows selected.

REM Histogram on SKEW

SELECT
    endpoint_number,
    endpoint_value
FROM
    user_tab_histograms
WHERE
        table_name = 'SKEWED_DATA_DISTRIBUTION'
    AND column_name = 'SKEW';

ENDPOINT_NUMBER  ENDPOINT_VALUE
--------------- ---------------
            100               1
          10000               2
         100000               3
       10000000               4

4 rows selected.

Workload

I am going to run 2 scripts a few times to simulate a query workload. The scripts differ in their bind variable values:

$ diff -y skew_unselective.sql skew_selective.sql
var the_id number						var the_id number
var the_skew number						var the_skew number

exec :the_id := 100						exec :the_id := 100
exec :the_skew := 4					      |	exec :the_skew := 1

WITH q AS (							WITH q AS (
    SELECT id,							    SELECT id,
        skew							        skew
    FROM							    FROM
        skewed_data_distribution				        skewed_data_distribution
    WHERE							    WHERE
        id = :the_id						        id = :the_id
        OR skew = :the_skew					        OR skew = :the_skew
)								)
SELECT /*+ gather_plan_statistics opt_param('_b_tree_bitmap_p	SELECT /*+ gather_plan_statistics opt_param('_b_tree_bitmap_p
    COUNT(*)							    COUNT(*)
FROM								FROM
    q;								    q;

Selective bind variables

Using my runMany.sh script I launched 20 instances of the more selective query first. The expected execution plan is as follows:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
SQL_ID  8tjz2tqn9gtck, child number 0
-------------------------------------
WITH q AS (     SELECT id,         skew     FROM
skewed_data_distribution     WHERE         id = :the_id         OR skew
= :the_skew ) SELECT /*+ gather_plan_statistics
opt_param('_b_tree_bitmap_plans','false') */     COUNT(*) FROM     q

Plan hash value: 414689775

---------------------------------------------------------------------------...
| Id  | Operation                              | Name                     |...
---------------------------------------------------------------------------...
|   0 | SELECT STATEMENT                       |                          |...
|   1 |  SORT AGGREGATE                        |                          |...
|   2 |   VIEW                                 | VW_ORE_9774CF0C          |...
|   3 |    UNION-ALL                           |                          |...
|*  4 |     INDEX RANGE SCAN                   | I_SKEW_1                 |...
|*  5 |     TABLE ACCESS BY INDEX ROWID BATCHED| SKEWED_DATA_DISTRIBUTION |...
|*  6 |      INDEX RANGE SCAN                  | I_SKEW_2                 |...
---------------------------------------------------------------------------...

Peeked Binds (identified by position):
--------------------------------------

   1 - :1 (NUMBER): 100
   2 - :2 (NUMBER): 1

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("ID"=:THE_ID)
   5 - filter(LNNVL("ID"=:THE_ID))
   6 - access("SKEW"=:THE_SKEW)


34 rows selected.

In other words, you can witness the result of the query transformation. Also note the VIEW VW_ORE%: it doesn’t exist in the database, it only exists thanks to the query transformation. I think this is a prime use case for the or-expansion: by splitting the or condition in the where clause into separate statements Oracle can make use of the indexes.

Really unselective predicate

So here I thought the trouble might arise: what if instead of a small-ish subset of the rows Oracle has to return the majority of the table instead? It can be as simple as replacing the value for SKEW from a selective-ish predicate to a highly unselective one. Which doesn’t trigger a hard parse due to a change the query text.

Let’s recap. So far, I have 1 execution plan for my statement in the shared pool, covering all executions up to now (29 to be precise):

SELECT
    sql_id,
    plan_hash_value,
    child_number,
    executions,
    is_bind_aware,
    is_bind_sensitive
FROM
    v$sql
WHERE
    sql_id = '8tjz2tqn9gtck';

SQL_ID        PLAN_HASH_VALUE    CHILD_NUMBER      EXECUTIONS I I
------------- --------------- --------------- --------------- - -
8tjz2tqn9gtck       414689775               0              29 N Y

Let’s try the second query. Using runMany.sh I launched skew_unselective.sql 10 times against the database. After they completed, I checked the shared pool again:

SELECT
    sql_id,
    plan_hash_value,
    child_number,
    executions,
    is_bind_aware,
    is_bind_sensitive
FROM
    v$sql
WHERE
    sql_id = '8tjz2tqn9gtck';

SQL_ID        PLAN_HASH_VALUE    CHILD_NUMBER      EXECUTIONS I I
------------- --------------- --------------- --------------- - -
8tjz2tqn9gtck       414689775               0              39 N Y

So here is a problem: The next batch of my queries used the “wrong”, or rather suboptimal plan. This eventually results and longer elapsed time/query. However, during my tests-and I appreciate it’s a bit limited in scope-I noticed that the next time I ran the un-selective query, another child cursor appeared:

SELECT
    sql_id,
    plan_hash_value,
    child_number,
    executions,
    is_bind_aware,
    is_bind_sensitive
FROM
    v$sql
WHERE
    sql_id = '8tjz2tqn9gtck';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS I I
------------- --------------- ------------ ---------- - -
8tjz2tqn9gtck       414689775            0         39 N Y
8tjz2tqn9gtck      1662074091            1          1 N Y

SQL> select * from dbms_xplan.display_cursor('8tjz2tqn9gtck',1);

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
SQL_ID  8tjz2tqn9gtck, child number 1
-------------------------------------
WITH q AS (     SELECT id,         skew     FROM
skewed_data_distribution     WHERE         id = :the_id         OR skew
= :the_skew ) SELECT /*+ gather_plan_statistics
opt_param('_b_tree_bitmap_plans','false') */     COUNT(*) FROM     q

Plan hash value: 1662074091

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name                     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                          |       |       | 66204 (100)|          |
|   1 |  SORT AGGREGATE    |                          |     1 |     9 |            |          |
|*  2 |   TABLE ACCESS FULL| SKEWED_DATA_DISTRIBUTION |  9900K|    84M| 66204   (1)| 00:00:03 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("SKEW"=:THE_SKEW OR "ID"=:THE_ID))


22 rows selected.

Now that’s better! With the totally un-selective predicate it doesn’t make sense to use the index. The full scan is a far better choice. What happens when I run the selective query again?

SQL> @skew_selective

PL/SQL procedure successfully completed.

PL/SQL procedure successfully completed.

  COUNT(*)
----------
       101

Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------
SQL_ID  8tjz2tqn9gtck, child number 2
-------------------------------------
WITH q AS (     SELECT id,         skew     FROM
skewed_data_distribution     WHERE         id = :the_id         OR skew
= :the_skew ) SELECT /*+ gather_plan_statistics
opt_param('_b_tree_bitmap_plans','false') */     COUNT(*) FROM     q

Plan hash value: 414689775

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name                     | Starts | E-Rows |E-Bytes| Cost (%CPU)|
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                          |      1 |        |       |    10 (100)|
|   1 |  SORT AGGREGATE                        |                          |      1 |      1 |       |            |
|   2 |   VIEW                                 | VW_ORE_9774CF0C          |      1 |    101 |       |    10   (0)|
|   3 |    UNION-ALL                           |                          |      1 |        |       |            |
|*  4 |     INDEX RANGE SCAN                   | I_SKEW_1                 |      1 |      1 |     6 |     3   (0)|
|*  5 |     TABLE ACCESS BY INDEX ROWID BATCHED| SKEWED_DATA_DISTRIBUTION |      1 |    100 |   900 |     7   (0)|
|*  6 |      INDEX RANGE SCAN                  | I_SKEW_2                 |      1 |    100 |       |     3   (0)|
------------------------------------------------------------------------------------------------------------------

Peeked Binds (identified by position):
--------------------------------------

   1 - :1 (NUMBER): 100
   2 - :2 (NUMBER): 1

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("ID"=:THE_ID)
   5 - filter(LNNVL("ID"=:THE_ID))
   6 - access("SKEW"=:THE_SKEW)

SELECT
    sql_id,
    plan_hash_value,
    child_number,
    executions,
    is_bind_aware,
    is_bind_sensitive
FROM
    v$sql
WHERE
    sql_id = '8tjz2tqn9gtck';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS I I
------------- --------------- ------------ ---------- - -
8tjz2tqn9gtck       414689775            0         39 N Y
8tjz2tqn9gtck      1662074091            1          1 N Y
8tjz2tqn9gtck       414689775            2          1 Y Y

After a few more tries the number of child cursors settled down at 4:

SQL> SELECT
    sql_id,
    plan_hash_value,
    child_number,
    executions,
    is_bind_aware,
    is_bind_sensitive
FROM
    v$sql
WHERE
    sql_id = '8tjz2tqn9gtck';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS I I
------------- --------------- ------------ ---------- - -
8tjz2tqn9gtck       414689775            0         39 N Y
8tjz2tqn9gtck      1662074091            1          1 N Y
8tjz2tqn9gtck       414689775            2          5 Y Y
8tjz2tqn9gtck      1662074091            3          7 Y Y

Child cursors 0 and 1 haven’t seen further executions while 2 and 3 do.

Summary

My-admittedly limited-amount of testing suggests that it is indeed possible to use or-expansion successfully even with huge data skew and bind variables in 19.12.0. All of my database’s parameters remained at their default with the exception of _b_tree_bitmap_plans to prevent these from appearing.

And many thanks, as always, to Tanel Poder for sharing his scripts with the community. They have been taken to good use writing this post.

Deploying I/O intensive workloads in the cloud: Oracle Automatic Storage Management (ASM)

Over the past month I wrote a few posts about deploying I/O intensive workloads in the cloud. Using standard Linux tools, mainly Logical Volume Manager (LVM) I tried to prevent certain pitfalls from occurring. Although I’m a great fan of LVM and RAID (and their combination), there are situations where LVM/Software RAID aren’t part the best solution. This is especially true when it comes to extending a VM’s storage configuration for an Oracle Database.

Striping, Mirroring and Risk

With LVM RAID (or LVM on top of Software RAID) it is possible to stripe an Oracle database-or any other I/O intensive workload-across multiple disks. At the risk of losing the RAID device (remember that RAID 0 offers exactly zero protection from disk failure) you can gain a performance advantage. The risk can be partially mitigated by using a proven, tested, and most importantly, rehearsed technique to still meet the RTO and RPO of the database.

The trouble with LVM RAID can potentially start as soon as you add more storage to the VM. I hope I managed to demonstrate the risk of I/O hotspots in my earlier posts.

Oracle’s ASM is different from stock-Linux tools, and it’s much less of a general purpose solution. Being an Oracle product it is also subject to a different license model. Which rules it out for most generic use cases, or at least that’s my experience. If, however, you want to deploy an Oracle database in the cloud, it is well worth considering ASM. I don’t want to say it’s free of drawbacks (no piece of software is) but in my opinion its benefits outweigh the disadvantages deploying a database.

For the sake of argument I’ll treat Oracle Restart and Grid Infrastructure as synonyms in this article. Oracle Restart is made up of ASM as well as a trimmed version of Oracle’s Clusterware as used in Real Application Clusters. Oracle Restart is installed into a separate Oracle Home, you usually install one database software home in addition. More on that later.

ASM vs LVM: a Question of Concepts

ASM has been around for quite some time and I like to think of it as a mature technology. In a way it is similar to LVM as you aggregate block devices (Physical Volumes in LVM) into Disk Groups (Volume Groups in LVM). Rather than creating another layer of abstraction on top of the ASM Disk Group as you do with LVM you simply point the database at a couple of Disk Groups and you are done. There is no need to maintain an equivalent of a Logical Volume or file system. A shorter code path to traverse tends to be less work. And it’s common knowledge that the fastest way to do something is not to do it in the first place. I should also point out that ASM does not perform I/O. It’s always the database session that does; otherwise ASM would never scale.

But what about protection from failure? Put very simply, in ASM you have a choice between striping and striping + mirroring. There are multiple so-called redundancy levels each with their own implications. If you are interested you can find the relevant details in Oracle’s Automatic Storage Management Administration Guide.

My Test Environment’s Setup

To keep things consistent with my previous posts I am installing Oracle Restart on my VM.Standard.E4.Flex VM in Oracle Cloud Infrastructure. Both Grid Infrastructure and database software are patched to 19.12.0, the current release at the time of writing. The underlying Linux version is 8.4 with kernel 5.4.17-2102.203.6.el8uek.x86_64. I decided to use UDEV rules for device name persistence and setting permissions rather than ASMLib or ASM Filter Driver. To keep things simple and also to follow the path I chose with my previous LVM/RAID posts I’m going to create the +DATA and +RECO Disk Groups with EXTERNAL redundancy. With external redundancy failure of a single block device in an ASM Disk Group will bring the entire Disk Group down, taking the database with it: game over. This is the same as with a RAID 0 configuration.

Again, and in line with the other posts about the topic, this article doesn’t concern itself with the durability of block devices in the cloud. External Redundancy should only be considered if approved in your organisation. You are most likely also required to put additional means in place to guarantee the database’s RTO and RPO. See my earlier comments and posts for details.

My +DATA disk group is currently made up of 2 block devices, +RECO consists of just 1 device. The database lives in +DATA with the Fast Recovery Area (FRA) located on +RECO.

SQL> select dg.name dg_name, dg.type, d.name disk_name, d.os_mb, d.path
  2   from v$asm_disk d join v$asm_diskgroup dg on (d.group_number = dg.group_number);

DG_NAME    TYPE   DISK_NAME       OS_MB PATH
---------- ------ ---------- ---------- ------------------------------
RECO       EXTERN RECO_0000      511998 /dev/oracleoci/oraclevde1
DATA       EXTERN DATA_0001      511998 /dev/oracleoci/oraclevdd1
DATA       EXTERN DATA_0000      511998 /dev/oracleoci/oraclevdc1

You can see from the volume sizes this is a lab/playground environment. The concepts however are independent of disk size. Just make sure the disks you use are of the same size and performance characteristics. Terraform is the most convenient way in the cloud to ensure they are.

Performance

Just as before I’ll start the familiar Swingbench workload. It isn’t meant to benchmark the system but to see which disks are in use. As in the previous examples I gave, Online Redo Logs aren’t multiplexed. This really is acceptable only in this scenario and shouldn’t be done with any serious deployments of the database. It helps me isolate I/O though, hence it’s why I did it.

Before getting detailed I/O performance figures I need to check the current device mapping:

SQL> !ls -l /dev/oracleoci/oraclevd{c,d}1
lrwxrwxrwx. 1 root root 7 Sep  1 15:21 /dev/oracleoci/oraclevdc1 -> ../sdc1
lrwxrwxrwx. 1 root root 7 Sep  1 15:21 /dev/oracleoci/oraclevdd1 -> ../sdd1

Looking at the iostat output I can see both /dev/sdc and /dev/sdd actively used:

[oracle@oracle-19c-asm ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-asm)  09/01/2021      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.19    0.00    0.26    0.12    0.01   98.43

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              1.12    1.03      0.04      0.03     0.01     0.54  ...  0.10
dm-0             1.03    0.95      0.03      0.03     0.00     0.00  ...  0.08
dm-1             0.02    0.60      0.00      0.01     0.00     0.00  ...  0.01
sdb              0.87    0.51      0.04      0.00     0.00     0.12  ...  0.09
dm-2             0.86    0.63      0.04      0.00     0.00     0.00  ...  0.09
sdc            291.58    4.87     54.15      0.05     3.51     0.01  ... 22.92
sdd            289.95    4.05     53.63      0.04     3.37     0.01  ... 19.01
sde              0.13    0.00      0.00      0.00     0.00     0.00  ...  0.01
sdf              0.10    0.72      0.00      0.01     0.00     0.00  ...  0.13

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.23    0.00    7.77   23.90    0.33   63.78

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    2.40      0.00      0.05     0.00     1.20  ...  0.12
dm-0             0.00    0.60      0.00      0.00     0.00     0.00  ...  0.08
dm-1             0.00    3.00      0.00      0.05     0.00     0.00  ...  0.04
sdb              0.00    0.40      0.00      0.00     0.00     0.00  ...  0.02
dm-2             0.00    0.40      0.00      0.00     0.00     0.00  ...  0.02
sdc           24786.60   67.40    211.80      0.57  2319.60     0.00 ... 100.00
sdd           24575.40   72.00    210.01      0.55  2302.80     0.00 ...  97.70
sdf              0.00    0.40      0.00      0.00     0.00     0.00  ...  0.06

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.74    0.00    7.65   24.38    0.31   62.93

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    1.80      0.00      0.02     0.00     0.20  ...  0.04
dm-0             0.00    1.20      0.00      0.02     0.00     0.00  ...  0.02
dm-1             0.00    0.80      0.00      0.01     0.00     0.00  ...  0.02
sdc           24684.20   61.60    215.14      0.50  2844.40     0.40 ... 100.00
sdd           24399.80   68.40    212.41      0.55  2787.20     0.60 ...  95.74
sdf              0.00    0.80      0.00      0.01     0.00     0.00  ...  0.10

This should demonstrate the fact ASM stripes data across disks. Up to this point there isn’t any visible difference in the iostat output compared to my previous posts.

Extending Storage

The main difference between LVM/RAID and ASM is yet to come: what happens if I have to add storage to the +DATA disk group? Remember that with LVM you had to add as many additional devices as you had in use. In other words, if you used a RAID 0 consisting of 2 block devices, you need to add another 2. With ASM you don’t have the same restriction as you can see in a minute.

I have added another block device to the VM, named /dev/oracleoci/oraclevdf with the exact same size and performance characteristics as the existing 2 devices. After partitioning it and checking for device permissions I can add the device to the Disk Group. There are many ways to do so, I’m showing you the SQL interface.

[grid@oracle-19c-asm ~]$ sqlplus / as sysasm

SQL*Plus: Release 19.0.0.0.0 - Production on Thu Sep 2 06:21:08 2021
Version 19.12.0.0.0

Copyright (c) 1982, 2021, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.12.0.0.0

SQL> alter diskgroup data add disk '/dev/oracleoci/oraclevdf1' ; 

Diskgroup altered.

SQL>

The prompt returns immediately, however there is an asynchronous operation started in the background, a so-called re-balance task:

SQL> select dg.name, o.operation, o.state,o.sofar,o.est_work,o.est_minutes, o.error_code
  2   from v$asm_diskgroup dg join v$asm_operation o using (group_number)
  3  /

NAME                           OPERA STAT      SOFAR   EST_WORK EST_MINUTES ERROR_CODE
------------------------------ ----- ---- ---------- ---------- ----------- --------------------------------------------
DATA                           REBAL RUN       14608          0           0
DATA                           REBAL DONE          0          0           0
DATA                           REBAL DONE      33308      33308           0

Once completed, another disk has been added to the +DATA disk group:

SQL> select dg.name dg_name, dg.type, d.name disk_name, d.os_mb, d.path
  2   from v$asm_disk d join v$asm_diskgroup dg on (d.group_number = dg.group_number)
  3  where dg.name = 'DATA'
  4  /

DG_NAME    TYPE   DISK_NAME	  OS_MB PATH
---------- ------ ---------- ---------- ------------------------------
DATA	   EXTERN DATA_0002	 511998 /dev/oracleoci/oraclevdf1
DATA	   EXTERN DATA_0000	 511998 /dev/oracleoci/oraclevdc1
DATA	   EXTERN DATA_0001	 511998 /dev/oracleoci/oraclevdd1

SQL> 

The disk rebalance operation is an online operation by the way with a few tunables such as the so-called power limit: you can trade off completion time vs effect it has on ongoing I/O operations. For some time the maximum value of ASM’s power limit was 11 ;)

What does that mean for our Swingbench workload? Let’s have a look at iostat while the same workload is running. Please remember that /dev/oracleoci/oraclevd[cdf]1 are part of the ASM +DATA Disk Group:

[grid@oracle-19c-asm ~]$ ls -l /dev/oracleoci/oraclevd[cdf]1
lrwxrwxrwx. 1 root root 7 Sep  2 06:30 /dev/oracleoci/oraclevdc1 -> ../sdd1
lrwxrwxrwx. 1 root root 7 Sep  2 06:30 /dev/oracleoci/oraclevdd1 -> ../sdb1
lrwxrwxrwx. 1 root root 7 Sep  2 06:35 /dev/oracleoci/oraclevdf1 -> ../sdf1

Please bear this in mind when looking at the iostat output:

[grid@oracle-19c-asm ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-asm) 	09/02/2021 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.27    0.03    0.37    0.40    0.03   98.90

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   ...  %util
sda              4.92    1.21      0.14      0.08     0.03   ...   0.26
dm-0             4.53    0.68      0.13      0.07     0.00   ...   0.23
dm-1             0.12    0.75      0.00      0.01     0.00   ...   0.02
sdb            391.83    7.36     12.15      3.60    27.41   ...   6.90
sdc              0.15    0.71      0.00      0.01     0.00   ...   0.14
sdd            396.92    8.48     12.20      3.61    28.23   ...   6.85
sdf            383.58   13.97      3.22     10.71    27.53   ...   5.92
sde              3.74    0.85      0.19      0.01     0.00   ...   0.28
dm-2             3.75    1.02      0.19      0.01     0.00   ...   0.28

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.60    0.00   12.18   26.38    1.61   52.24

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   ...  %util
sda              0.00    0.40      0.00      0.00     0.00   ...   0.06
dm-0             0.00    0.40      0.00      0.00     0.00   ...   0.06
sdb           24375.60  176.80    203.25      1.39  1635.40  ...   97.62
sdc              0.00    0.80      0.00      0.01     0.00   ...   0.14
sdd           24654.60  172.40    205.89      1.45  1689.80  ...   99.96
sdf           24807.40  201.20    207.31      1.51  1718.20  ...   97.86
sde              0.00    1.00      0.00      0.01     0.00   ...   0.04
dm-2             0.00    1.20      0.00      0.01     0.00   ...   0.04

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.22    0.00   13.05   23.61    1.55   54.57

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   ...  %util
sda              0.00    0.60      0.00      0.00     0.00   ...   0.10
dm-0             0.00    0.40      0.00      0.00     0.00   ...   0.04
dm-1             0.00    0.20      0.00      0.00     0.00   ...   0.06
sdb           24783.40  145.40    212.17      1.15  2363.20  ...   97.48
sdc              0.00    0.60      0.00      0.00     0.00   ...   0.14
sdd           24795.40  113.60    213.19      1.00  2470.80  ...   99.90
sdf           24871.00  106.00    213.34      0.97  2426.00  ...   97.00
sde              0.00    2.40      0.00      0.02     0.00   ...   0.08
dm-2             0.00    2.60      0.00      0.02     0.00   ...   0.08

You can see that all 3 disks are more or less evenly used. This is the main difference to the use of LVM RAID. Thanks to the rebalance operation all data on the disk group is redistributed across the disks in the group.

Summary

When it comes to deploying an Oracle database in an Infrastructure as a Service (IaaS) scenario Oracle’s ASM offers lots of advantages over stock Linux tools. For example, it is possible to add storage to an ASM Disk Group as and when it’s needed without over-provisioning. ASM furthermore rebalances all data in the Disk Group across all disks as part of a configuration change as you just saw. That way it is much harder to create I/O hotspots I often see when ASM is not in use.

In addition to ASM you also get other amenities as a side effect. For example, Oracle Restart allows you to start databases and database services automatically when the system boots up. There is no need to write systemd unit files as it’s all done behind the covers. Should your database crash for some reason, provided it can, Oracle Restart automatically brings it up again without your intervention. It also works beautifully in conjunction with Oracle’s Universal Connection Pool (UCP) and Data Guard.

The use of ASM implies direct I/O. I said earlier that ASM doesn’t maintain a file system layer when used for the Oracle database (that’s not entirely correct but true for all the databases I saw) and as a result Linux can’t cache I/O. This is considered a good thing in the community by most. Oracle has its own buffer cache after all, as long as it’s sized appropriately for your workload, double-buffering isn’t the best use of precious DRAM.

So much for the plus side, but what about the implications of using Oracle Restart? First of all, it’s another Oracle software home you need to maintain. Given the high degree of automation possible these days that shouldn’t be an issue. An Ansible playbook is easy enough to write, patching all Oracle Restart components.

If your organisation mandates a separation of duties between database and storage/Linux administration your respective administrator might need to learn a new technology.

I’m sure you can think of additional downsides to using ASM, and I admit I won’t delve into the subject deeper as I’m quite biased. ASM has been one of the truly outstanding innovations for running Oracle in my opinion. The human aspect of introducing a new technology however isn’t to be under-estimated and the best technology doesn’t always win the race.

Resolving slight niggles of Enterprise Manager Express 19c

This page, should I remember I wrote it, hopefully addresses the slight niggles I have with Oracle Enterprise Manager Express. I always forget how to solve these and it takes me a minute to remember. I hope this page helps me jump start my memory. If you have any additional niggles to report please do and I’ll add them here.

OEM Express not working in Grid Infrastructure when separation of duties is enabled

In case you installed Oracle Restart (and I presume the same applies for Real Application Clusters as well) with a different account than the database you won’t be able to access OEM Express straight away. The most common issue I had was this

[oracle@server3 ~]$ curl --verbose --insecure https://server3:5510/em
* About to connect() to server3 port 5510 (#0)
*   Trying 192.168.100.13...
* Connected to server3 (192.168.100.13) port 5510 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file

I have also seen this one (with port-forwarding in use)

$ curl --insecure --verbose https://localhost:5510/em
* Uses proxy env variable no_proxy == 'localhost,127.0.0.0/8,::1'
*   Trying 127.0.0.1:5510...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 5510 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:5510 
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:5510 

In a browser you get something along the lines of “this site can’t be reached … ERR_CONNECTION_CLOSED”.

This issue is addressed in My Oracle Support (MOS) Doc ID 1604062.1 “Troubleshooting why EM Express is not working”. Search for item 10 in the table of context for the resolution.

Invalid Container Name when trying to connect to a PDB

When enabling OEM Express by setting the HTTPS port in CDB$ROOT you explicitly enabled it for the root container only. Connecting to OEM Express using this port (and omitting the container name) provides you with information about CDB$ROOT as well as all other Pluggable Databases (PDBs).

But what if you want to connect to a specific PDB? In this regard the login screen presented by OEM Express can be a little misleading as you can’t enter a container name without some further work. Unless that work is completed you get an error (“Invalid Container Name”) even though both credentials and container name are correct.

This can be changed though. Since Oracle 12.2 it is possible to define a single, global OEM Express port for the CDB and all it’s PDBs as documented in the 2 Day DBA manual. After implementing the change it is possible to log in to a specific PDB by supplying its name in the login screen. I couldn’t find the OEM equivalent drop-down menu allowing me to switch back and forth between CDB$ROOT and the other containers so it seems to be log-off/log-on.

By the way, in 12.1 you had to switch to each PDB for which you wanted to enable OEM Express and execute a separate call to dbms_xdb_config.sethttpsport().

More to come

This is a living document and I’ll update it with further niggles as and when I hit them.

Deploying I/O intensive workloads in the cloud: mdadm (aka Software) RAID

The final part of my “avoiding pitfalls with Linux Logical Volume Manager” (LVM) series considers software RAID on Oracle Linux 8 as the basis for your LVM’s Physical Volume (PV). It’s still the very same VM.Standard.E4.Flex running Oracle 19.12.0 on top of Oracle Linux 8.4 with UEK6 (5.4.17-2102.203.6.el8uek.x86_64) I used for creating the earlier posts.

Previous articles in this series can be found here:

Storage Configuration

Rather than using LVM-RAID as in the previous article, the plan this time is to create a software RAID (pseudo-device) and use it as a Physical Volume. This is exactly what I have done before I learned about LVM RAID. Strictly speaking, it isn’t necessary to create a Volume Group on top of a RAID device as you can absolutely use such a device on its own. Having said that, growing a RAID 0 device doesn’t seem possible after my limited time studying the documentation. Speaking of which: you can read more about software RAID in Red Hat Linux 8 here.

In this post I’ll demonstrate how you could use a RAID 0 device for striping data across multiple disks. Please don’t implement the steps in this article unless software RAID is an approved solution in your organisation and you are aware of the implications. Kindly note this article does not concern itself with the durability of block devices in the cloud. In the cloud, you have a lot less control over the block devices you get, so make sure you have appropriate protection methods in place to guarantee your databases’ RTO and RPO. RAID 0 offers 0 protection from disk failure (it’s in the name ;), so as soon as you lose a disk from your software RAID, it’s game over.

Creating the RAID Device

The first step is to create the RAID device. For nostalgic reasons I named it /dev/md127, other sources name their devices /dev/md0. Not that it matters too much.

[opc@oracle-19c-fs ~]$ sudo mdadm --create /dev/md127 --level=0 \
> --raid-devices=2 /dev/oracleoci/oraclevdc1 /dev/oracleoci/oraclevdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md127 started.
[opc@oracle-19c-fs ~]$ 

As you can see from the output above mdadm created the device for me. If you wondered what the funny device names imply, have a look at an earlier post I wrote about device name persistence in OCI.

You can always use mdadm --detail to get all the interesting details from a RAID device:

[opc@oracle-19c-fs ~]$ sudo mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Fri Aug  6 14:15:12 2021
        Raid Level : raid0
        Array Size : 524019712 (499.74 GiB 536.60 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Fri Aug  6 14:15:12 2021
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : oracle-19c-fs:127  (local to host oracle-19c-fs)
              UUID : 30dc8f99...
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
[opc@oracle-19c-fs ~]$  

This is looking good – both devices are available and no errors have occurred.

Creating oradata_vg

With the future PV available it’s time to create the Volume Group and the Logical Volumes (LV) for the database and Fast Recovery Area. I’m listing the steps here for later reference, although they are the same as in part 1 of this article.

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 1) create the PV
[opc@oracle-19c-fs ~]$ sudo pvcreate /dev/md127
  Physical volume "/dev/md127" successfully created.

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 2) create the VG
[opc@oracle-19c-fs ~]$ sudo vgcreate oradata_vg /dev/md127
  Volume group "oradata_vg" successfully created

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 3) create the first LV
[opc@oracle-19c-fs ~]$ sudo lvcreate --extents 80%FREE --name oradata_lv oradata_vg 
  Logical Volume "oradata_lv" created

[opc@oracle-19c-fs ~]$ #
[opc@oracle-19c-fs ~]$ # step 4) create the second LV
[opc@oracle-19c-fs ~]$ sudo lvcreate --extents 100%FREE --name orareco_lv oradata_vg 
  Logical volume "orareco_lv" created.

The end result are 2 LVs in oradata_vg:

[opc@oracle-19c-fs ~]$ sudo lvs oradata_vg
  LV         VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  oradata_lv oradata_vg -wi-a----- 399.79g                                                    
  orareco_lv oradata_vg -wi-a----- <99.95g   

That’s it! The LVs require file systems before they can be mounted (not shown here).

Trying it out

After the final touches have been applied I restored the database and started the familiar Swingbench workload to see which disks are in use. Right before I did that I ensured I’m not multiplexing control files/online redo logs in the FRA for test purposes only. NOT multiplexing control files/online redo log members is probably a Bad Idea for serious Oracle deployments but ok for this scenario.

I am expecting to see both block devices making up /dev/md127 used. And sure enough, they are:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   13/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.23    0.01    0.35    0.57    0.01   98.83

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              2.99    0.96      0.08      0.04     0.03     0.26  ...   0.21
dm-0             2.78    0.62      0.07      0.03     0.00     0.00  ...   0.20
dm-1             0.06    0.58      0.00      0.01     0.00     0.00  ...   0.02
sdb              1.28    0.22      0.06      0.00     0.00     0.02  ...   0.13
dm-2             1.26    0.24      0.06      0.00     0.00     0.00  ...   0.13
sdc            753.52   26.38      8.37      5.64    30.91     0.29  ...   7.36
md127         1573.79   53.30     17.44     12.01     0.00     0.00  ...   0.00
sdd            758.09   26.57      8.42      5.64    31.29     0.05  ...   9.34
sde             20.53    0.00      5.11      0.00     0.00     0.00  ...   1.79
dm-3            20.51    0.00      5.11      0.00     0.00     0.00  ...   1.79
dm-4          1558.54   28.25     12.20      5.97     0.00     0.00  ...   6.56
dm-5             4.69    2.61      4.58      5.26     0.00     0.00  ...   4.15

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.08    0.00    5.32    9.48    0.13   80.99

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    3.40      0.00      0.03     0.00     0.60  ...  0.08
dm-0             0.00    2.60      0.00      0.02     0.00     0.00  ...  0.08
dm-1             0.00    1.40      0.00      0.01     0.00     0.00  ...  0.04
sdc           16865.80  284.60    140.04      2.39  1059.60     0.20 ...  92.60
md127         36008.00  564.20    281.33      4.76     0.00     0.00 ...   0.00
sdd           16978.80  279.40    141.11      2.34  1081.40     0.00 ...  99.96
dm-4          36007.80  563.00    281.33      4.73     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.07    0.00    5.51   10.52    0.16   79.74

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdb              0.00    0.80      0.00      0.01     0.00     0.20  ...  0.04
dm-2             0.00    1.00      0.00      0.01     0.00     0.00  ...  0.04
sdc           17709.80  317.80    142.87      2.51   577.40     0.00 ...  93.90
md127         36657.80  661.60    286.41      5.31     0.00     0.00 ...   0.00
sdd           17790.00  343.40    143.69      2.77   599.00     0.00 ...  99.94
dm-4          36657.80  660.20    286.41      5.28     0.00     0.00 ... 100.00

[opc@oracle-19c-fs ~]$ 

No surprises here! Except maybe that /dev/md127 was somewhat underutilised ;) I guess that’s an instrumentation bug/feature. /dev/dm-4 – showing 100% utilisation – belongs to oradata_lv:

[opc@oracle-19c-fs ~]$ ls -l /dev/mapper | egrep dm-4
lrwxrwxrwx. 1 root root       7 Aug 13 09:37 oradata_vg-oradata_lv -> ../dm-4

Extending oradata_vg

Just as with each previous example I’d like to see what happens when I run out of space and have to extend oradata_vg. For this to happen I need a couple more block devices. These have to match the existing ones in size and performance characteristics for the best result. No difference to LVM-RAID I covered in the earlier article.

I created /dev/md128 in the same way as I did for the original RAID device and created a Physical Volume from it. oradata_vg looked like this prior to its extension:

[opc@oracle-19c-fs ~]$ sudo vgs oradata_vg
  VG         #PV #LV #SN Attr   VSize   VFree
  oradata_vg   1   2   0 wz--n- 499.74g    0 

In the next step I extended the Volume Group but only after I ensured I have a proven, working backup of everything. Don’t ever make changes to the storage layer without a backup and a known, tested, proven way to recover from unforeseen issues!

[opc@oracle-19c-fs ~]$ sudo vgextend oradata_vg /dev/md128
  Volume group "oradata_vg" successfully extended
[opc@oracle-19c-fs ~]$ sudo vgs oradata_vg
  VG         #PV #LV #SN Attr   VSize   VFree  
  oradata_vg   2   2   0 wz--n- 999.48g 499.74g

The VG now shows 2 PVs and plenty of free space. So let’s add 80% of the free space to oradata_lv.

[opc@oracle-19c-fs ~]$ sudo lvresize --extents +80%FREE --resizefs /dev/mapper/oradata_vg-oradata_lv
  Size of logical volume oradata_vg/oradata_lv changed from 399.79 GiB (102347 extents) to <799.59 GiB (204695 extents).
  Logical volume oradata_vg/oradata_lv successfully resized.
meta-data=/dev/mapper/oradata_vg-oradata_lv isize=512    agcount=16, agsize=6550144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=104802304, imaxpct=25
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=51173, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 104802304 to 209607680

The LV changes from its original size …

[opc@oracle-19c-fs ~]$ sudo lvs /dev/mapper/oradata_vg-oradata_lv
  LV         VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert                                                         
  oradata_lv oradata_vg -wi-ao---- 399.79g

to its new size:

[opc@oracle-19c-fs ~]$ sudo lvs /dev/mapper/oradata_vg-oradata_lv
  LV         VG         Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  oradata_lv oradata_vg -wi-ao---- <799.59g                                                    

The same applies to the file system as well:

[opc@oracle-19c-fs ~]$ df -h /u01/oradata
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/oradata_vg-oradata_lv  800G   38G  762G   5% /u01/oradata

Does that change performance?

Based on my experience with LVM-RAID I did not expect a change in performance as my database wasn’t yet at a stage where it required the extra space yet. My assumption was confirmed by iostat:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   13/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.98    0.01    1.44    2.35    0.03   95.18

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              2.32    0.99      0.06      0.03     0.02     0.27  ...  0.17
dm-0             2.16    0.61      0.06      0.03     0.00     0.00  ...  0.16
dm-1             0.05    0.62      0.00      0.01     0.00     0.00  ...  0.02
sdb              0.99    0.20      0.05      0.00     0.00     0.02  ...  0.11
dm-2             0.98    0.22      0.04      0.00     0.00     0.00  ...  0.11
sdc           4538.44   73.12     38.69      4.78   190.85     0.23  ... 26.27
md127         9485.50  147.14     78.09     10.13     0.00     0.00  ...  0.00
sdd           4562.89   73.73     38.90      4.79   193.25     0.04  ... 29.88
sde             15.87    0.00      3.95      0.00     0.00     0.00  ...  1.39
dm-3            15.86    0.00      3.95      0.00     0.00     0.00  ...  1.39
dm-4          9473.71  127.63     74.04      5.46     0.00     0.00  ... 27.74
dm-5             3.63    2.02      3.54      4.07     0.00     0.00  ...  3.21
sdf              0.07    0.00      0.00      0.00     0.00     0.01  ...  0.01
sdg              0.08    0.00      0.00      0.00     0.00     0.01  ...  0.00
md128            0.06    0.02      0.00      0.00     0.00     0.00  ...  0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.96    0.00    5.44    8.52    0.08   82.00

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdc           17652.60  306.80    141.15      2.52   414.40     0.00 ...  88.78
md127         36265.40  608.00    283.35      5.01     0.00     0.00 ...   0.00
sdd           17783.60  301.20    142.17      2.43   411.60     0.00 ... 100.00
dm-4          36267.40  607.00    283.37      4.95     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.20    0.00    5.45    8.82    0.14   81.38

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    1.20      0.00      0.01     0.00     0.00  ...  0.04
dm-0             0.00    1.00      0.00      0.01     0.00     0.00  ...  0.04
dm-1             0.00    0.20      0.00      0.00     0.00     0.00  ...  0.02
sdc           18145.40  332.20    143.99      2.55   284.40     0.00 ...  92.22
md127         36865.20  650.20    288.04      5.00     0.00     0.00 ...   0.00
sdd           18161.20  318.00    144.14      2.45   285.20     0.00 ...  99.98
dm-4          36863.20  649.00    288.02      4.99     0.00     0.00 ...  99.98

[opc@oracle-19c-fs ~]$ 

As long as there aren’t any database files in the “extended” part of the LV, there won’t be a change in performance. As soon as your database spills over to the “new” disks, you should see a benefit from the newly added /dev/dm128.

Summary

Just as LVM-RAID does, using software RAID allows you to benefit from striping data across multiple devices. The iostat output is quite clear about the benefit, just look at the figures for /dev/sdc, /dev/sdd and how they accumulate in /dev/md127.

Using software RAID doesn’t come without a risk, it’s entirely possible to lose a block device and thus the RAID device. It’s imperative you protect against this scenario in a way that matches your database’s RTO and RPO.

My main problem with the solution as detailed in this post is the lack of a re-balance feature you get with Oracle’s Automatic Storage Management (ASM). It’s still possible to have I/O hotspots after a storage space expansion.

Deploying I/O intensive workloads in the cloud: LVM RAID

I recently blogged about a potential pitfall when deploying the Oracle database on LVM (Logical Volume Manager) with its default allocation policy. I promised a few more posts detailing how to potentially mitigate the effect of linear allocation in LVM. The post was written with the same Oracle 19.12.0 database deployed to Oracle Linux 8.4 with UEK6 on a VM.Standard.E4.Flex cloud system as used for creating the previous article.

If you found this article via a search engine, there are a few more posts about this topic here:

LVM RAID

In this post I’ll demonstrate how you could use LVM RAID level 0. Please don’t implement the steps in this article unless software (LVM-)RAID is an approved solution in your organisation and you are aware of the implications. Please note this article does not concern itself with the durability of block devices in the cloud. In the cloud, you have a lot less control over the block devices you get, so make sure you have appropriate protection methods in place to guarantee your databases’ RTO and RPO.

I found a hint in the SuSE Linux Enterprise Service 15 documentation recommending the use of software RAID over LVMRAID. I’ll leave that here as I don’t have sufficient information to deny or acknowledge that statement. I didn’t find a comparable warning in the Red Hat 8 documentation.

Implementing LVM RAID 0

The basics of LVM RAID levels are described in lvmraid(7):

lvm(8) RAID is a way to create a Logical Volume (LV) that uses multiple physical devices to improve performance or tolerate device failures. In LVM, the physical devices are Physical Volumes (PVs) in a single Volume Group (VG).

man 7 lvmraid

This is interesting and I haven’t really been aware of that not-really-new development. Previously I created a software RAID pseudo-device first, and used it as a physical volume in my LVM configuration. So instead of using a block device’s partition as a PV, I used the device created by mdadm (/dev/md0 for example). Let’s try the new way!

There were no changes required to oradata_vg on my Oracle Linux 8.4 system. The Logical Volume however was created differently. After struggling with the exact syntax for a bit I ended up with this command:

[opc@oracle-19c-fs ~]$ sudo lvcreate --type raid0 --extents 511998 --name oradata_lv \
> --stripesize 1m oradata_vg

Note that RAID 0 offers exactly 0 protection against disk failure. You need to ensure you have other means in place to guarantee your database’s RTO and RPO! I took me a little while to get the syntax for LVM RAID 0 right. The optional parameter –stripesize “specifies the Size of each stripe in kilobytes. This is the amount of data that is written to one device before moving to the next.” I’m unsure if 1 MB is the right value, I probably need to experiment with this a little more.

In the next step I created the XFS file system on top of the oradata_lv and mounted the new file system in /u01/oradata for use with the database.

The output of my lvs command changed quite a bit to what it was before:

[opc@oracle-19c-fs ~]$ sudo lvs --all --options name,copy_percent,devices,attr oradata_vg
  LV                    Cpy%Sync Devices                                       Attr      
  oradata_lv                     oradata_lv_rimage_0(0),oradata_lv_rimage_1(0) rwi-aor---
  [oradata_lv_rimage_0]          /dev/sdc1(0)                                  iwi-aor---
  [oradata_lv_rimage_1]          /dev/sde1(0)                                  iwi-aor---
[opc@oracle-19c-fs ~]$

The above output is specific to LVM RAID 0, higher RAID levels feature *_rmeta images in addition to the *_rimage above. Since I’m not planning on converting from RAID 0 to a higher RAID level I don’t need to concern myself with a meta image in this configuration. See lvmraid(7) for a more thorough description of LVM Sub-Volumes.

Since RAID 0 doesn’t offer any protection from disk failure it doesn’t have to wait for any synchronisation to be completed before making the volume available.

Disk Performance LVM RAID 0

After I finished the restore of my database to the newly created LVM RAID 0 mount point I ran the same Swingbench workload as before, still using the ridiculous small SGA forcing physical I/O. As in the previous article the aim wasn’t to see what the configuration is capable of, I wanted to find out more about disk utilisation.

This time iostat showed multiple busy devices:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   06/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.74    0.00    1.14    5.43    0.01   90.68

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              0.27    0.85      0.01      0.02     0.00     0.49  ...   0.05
dm-0             0.27    0.77      0.01      0.01     0.00     0.00  ...   0.04
dm-1             0.00    0.57      0.00      0.01     0.00     0.00  ...   0.01
sdb              0.11    0.11      0.00      0.00     0.00     0.02  ...   0.02
sdc            993.21   14.94     14.18      0.28     0.00     0.02  ...  15.28
dm-2             0.11    0.13      0.00      0.00     0.00     0.00  ...   0.02
sdd              0.25    4.95      0.24      0.35     0.00     0.01  ...   1.63
dm-3             0.25    4.95      0.24      0.35     0.00     0.00  ...   1.63
sde           1013.79  424.90     15.25      3.79     0.00     0.04  ...  25.97
dm-4           991.89   14.54     14.16      0.27     0.00     0.00  ...  15.12
dm-5           992.43   14.65     14.19      0.27     0.00     0.00  ...  15.13
dm-6          1984.31   29.19     28.35      0.54     0.00     0.00  ...  15.25

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.30    0.00    2.93   29.68    0.03   66.06

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdc           7210.60  119.80     56.23      0.90     0.00     0.00  ... 99.60
sdd              0.00   24.60      0.00      0.10     0.00     0.00  ...  7.60
dm-3             0.00   24.60      0.00      0.10     0.00     0.00  ...  7.60
sde           7204.80  102.60     56.20      0.82     0.00     0.00  ... 99.74
dm-4          7209.20  119.60     56.22      0.90     0.00     0.00  ... 99.60
dm-5          7205.40  102.60     56.21      0.82     0.00     0.00  ... 99.76
dm-6          14414.60  222.20    112.43      1.72     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.11    0.00    2.92   27.86    0.01   67.10

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdc           6771.60  103.60     52.81      0.62     0.00     0.00  ... 99.82
sdd              0.00   61.80      0.00      0.22     0.00     0.00  ... 18.02
dm-3             0.00   62.00      0.00      0.22     0.00     0.00  ... 18.02
sde           6806.20   45.80     53.09      0.49     0.00     0.00  ... 99.94
dm-4          6771.40  103.60     52.80      0.62     0.00     0.00  ... 99.82
dm-5          6806.00   45.80     53.09      0.49     0.00     0.00  ... 99.94
dm-6          13577.40  149.40    105.89      1.10     0.00     0.00 ... 100.00

In the above output, /dev/sdc1 and /dev/sde1 are part of oradata_vg, hosting the database. I still didn’t multiplex control files and online redo logs to ensure all I/O is reported against oradata_vg . At the risk of repeating myself not multiplexing control file/online redo log members might not be a good idea for serious Oracle deployments.

But what about /dev/dm-{4,5,6}? Why are there suddenly so many Device-Mapper devices in the above iostat output?

[opc@oracle-19c-fs ~]$ ls -l /dev/mapper | grep dm-[4-6]
lrwxrwxrwx. 1 root root       7 Aug  6 08:17 oradata_vg-oradata_lv -> ../dm-6
lrwxrwxrwx. 1 root root       7 Aug  6 08:15 oradata_vg-oradata_lv_rimage_0 -> ../dm-4
lrwxrwxrwx. 1 root root       7 Aug  6 08:15 oradata_vg-oradata_lv_rimage_1 -> ../dm-5
[opc@oracle-19c-fs ~]$ 

These match the previous output of the lvs command: all Device-Mapper meta-devices 4, 5 and 6 all belong to oradata_vg. Using the iostat output it should be apparent that more than 1 block device is used by the database, striping seems to be working fine.

What happens to performance when you extend the VG?

Assuming you run out of storage on your volume group, what next? With linear allocation it’s a no brainer: ensure the presence of a backup, then add another Physical Volume to the Volume Group and resize the Logical Volume + file system: capacity is increased immediately.

With LVM RAID 0 the story is a little different. According to the Red Hat 8 documentation it is possible to run lvresize on a striped LV provided the same number of stripes as originally present is added to the Volume Group. On my system I originally used 2 block devices = 2 stripes in oradata_vg. Adding a couple more of the same size and performance characteristics allows me to resize the Logical Volume after I ensured I had a proven and tested backup of all data depending on oradata_vg:

[opc@oracle-19c-fs ~]$ sudo lvresize --extents +461996 --resizefs /dev/mapper/oradata_vg-oradata_lv
  Using stripesize of last segment 1.00 MiB                                 
  Size of logical volume oradata_vg/oradata_lv changed from 2.14 TiB (561998 extents) to <3.91 TiB (1023994 extents)
  Logical volume oradata_vg/oradata_lv successfully resized.
meta-data=/dev/mapper/oradata_vg-oradata_lv isize=512    agcount=33, agsize=16382976 blks          
         =                       sectsz=4096  attr=2, projid32bit=1                                   
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1                                 
data     =                       bsize=4096   blocks=524285952, imaxpct=5                          
         =                       sunit=1024   swidth=2048 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1                              
log      =internal log           bsize=4096   blocks=255999, version=2                             
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 524285952 to 575485952
[opc@oracle-19c-fs ~]$

It really has to be the same number of additional PVs, or otherwise you get the following error:

[opc@oracle-19c-fs ~]$ sudo vgdisplay oradata_vg | grep Free               
  Free  PE / Size       255999 / <1000.00 GiB

[opc@oracle-19c-fs ~]$ sudo lvresize --extents +255998 --resizefs /dev/mapper/oradata_vg-oradata_lv
  Using stripesize of last segment 1.00 MiB
  Insufficient suitable allocatable extents for logical volume oradata_lv: 255998 more required

Even though I have been able to add additional space (see above) it doesn’t appear to make a difference in performance:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 3
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs)   06/08/21        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.72    0.00    1.19    6.03    0.01   90.05

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.27    0.85      0.01      0.02     0.00     0.48  ...  0.05
dm-0             0.26    0.76      0.01      0.01     0.00     0.00  ...  0.04
dm-1             0.00    0.57      0.00      0.01     0.00     0.00  ...  0.01
sdb              0.11    0.11      0.00      0.00     0.00     0.02  ...  0.02
sdc           1137.78   17.06     15.29      0.29     0.00     0.02  ... 17.33
dm-2             0.11    0.13      0.00      0.00     0.00     0.00  ...  0.02
sdd              0.24    5.62      0.24      0.34     0.00     0.01  ...  1.84
dm-3             0.24    5.63      0.24      0.34     0.00     0.00  ...  1.84
sde           1157.81  417.01     16.34      3.72     0.00     0.04  ... 27.76
dm-4          1136.49   16.67     15.27      0.28     0.00     0.00  ... 17.18
dm-5          1136.97   16.76     15.31      0.29     0.00     0.00  ... 17.19
dm-6          2273.46   33.42     30.58      0.57     0.00     0.00  ... 17.31
sdf              0.00    0.00      0.00      0.00     0.00     0.00  ...  0.00
sdg              0.00    0.00      0.00      0.00     0.00     0.00  ...  0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    3.18   31.19    0.01   64.51

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sda              0.00    2.40      0.00      0.02     0.00     0.40  ...  0.04
dm-0             0.00    2.60      0.00      0.01     0.00     0.00  ...  0.04
dm-1             0.00    0.20      0.00      0.00     0.00     0.00  ...  0.02
sdc           7545.40   32.40     58.83      0.28     0.00     0.00  ... 99.92
sdd              0.00   14.40      0.00      0.06     0.00     0.00  ...  4.16
dm-3             0.00   14.40      0.00      0.06     0.00     0.00  ...  4.16
sde           7519.80   52.60     58.65      0.47     0.00     0.00  ... 99.76
dm-4          7545.20   32.40     58.83      0.28     0.00     0.00  ... 99.90
dm-5          7519.80   52.60     58.65      0.47     0.00     0.00  ... 99.76
dm-6          15065.00   85.00    117.48      0.75     0.00     0.00 ... 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.62    0.00    3.06   30.02    0.01   65.29

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ... %util
sdb              0.00    0.60      0.00      0.00     0.00     0.00  ...  0.02
sdc           7192.20  124.00     56.07      0.82     0.00     0.00  ... 99.78
dm-2             0.00    0.60      0.00      0.00     0.00     0.00  ...  0.02
sdd              0.00   46.60      0.00      0.17     0.00     0.00  ... 13.50
dm-3             0.00   46.60      0.00      0.17     0.00     0.00  ... 13.46
sde           7184.40   79.60     56.03      0.70     0.00     0.00  ... 99.78
dm-4          7193.60  124.00     56.08      0.82     0.00     0.00  ... 99.78
dm-5          7183.60   79.60     56.03      0.70     0.00     0.00  ... 99.78
dm-6          14377.20  203.60    112.11      1.51     0.00     0.00 ... 100.00

[opc@oracle-19c-fs ~]$ 

As you can see only those disks that were originally part of the volume group are in use. Unlike with Oracle’s Automatic Storage Management there is no automatic rebalancing of data.

Summary

Using LVM RAID 0 is an exciting new feature offering striping in LVM in a different way than previously possible. Compared to the linear allocation model demonstrated in the previous article it allows proper striping across disks in the Logical Volume. It should be noted though that RAID 0 – striping – does not offer any data protection. Failure of a single device in the RAID means all data is lost, immediately. Alternatives need to be in place to ensure your database’s RTO and RPO can be met.

Extending capacity of a LVM RAID 0 VG is possible provided you add the same number of devices (with the same size and performance characteristics) to the VG before executing the lvresize command.

The final article in this series cuts LVM out of the equation and focuses purely on Software RAID 0 and how it can be used in Oracle Linux 8.x and before.

Deploying I/O intensive workloads in the cloud: don’t fall for the LVM trap

I have been assessing the “best” storage option for Oracle databases deployed on Infrastructure as a Service (IaaS) VMs quite a few times now. As part of the discussion DBAs often pose the question whether using Linux’s Logical Volume Manager (LVM) is a good choice for data files.

It can be, depending on the circumstances but there are a few things to consider. This short series of posts tries to highlight some of them. Should I find the time I’ll write another post about using Oracle Automatic Storage Management (ASM) and how it’s different from LVM.

Please note this article does not concern itself with the durability of block devices in the cloud. I’ll leave it at this: in the cloud, you have a lot less control over the block devices you get, so make sure you have appropriate protection methods in place to guarantee your databases’ RTO and RPO.

LVM basics: a quick refresher

The main point of LVM as I see it is to dynamically grow file systems. Instead of creating file systems on partitioned block devices, with LVM you create them on top of Logical Volumes. To work with LVM you need to aggregate one or more so-called Physical Volumes (usually partitions on a block device) into a Volume Group. Once that’s done you can carve out Logical Volumes from the Volume Group. File systems such as ext4 and xfs are created on top of the Logical Volume (LV) just as you would on block devices:

+-------------------------------------------------------------+
|        VG1         |            VG2            |    free    |
+-------------------------------------------------------------+
|                             VG1                             |
+-------------------------------------------------------------+
|        PV1         |        PV2        |         PV3        | 
+-------------------------------------------------------------+

Logical Volumes can be resized as long as there is free space in the Volume Group. Should the Volume Group run out of space you can extend it, too, by adding a new Physical Volume to it.

That’s really a high-level overview, there is plenty of LVM specific documentation out there I recommend you have a look at if you want to learn more.

LVM is perfect for installing software

LVM is nothing short of perfect for installing software: running out of space is almost impossible when using LVM. I always use Logical Volumes (LVs) to install the Oracle binaries. Should I run out of space I can simply grow the LV and the file system – job done. Nowadays it is possible to resize the file system online, provided the necessary precautions (BACKUP!) are in place.

LVM for Oracle Data Files

The situation changes when talking about storing Oracle database data files or other I/O intensive workloads. The basic idea for using LVM with Oracle data files is the same: being able to dynamically grow a mount point if the existing storage allocation is insufficient. Which is mostly a cosmetic thing, there is nothing wrong with creating a new block device and using it as the source for data files in tablespaces.

If you create a Volume Group (VG) on top of 2 block devices (for the sake of argument) and thought you might benefit of a SAME (stripe and mirror everything) approach you might be disappointed to hear that you won’t (if you use the defaults).

Consider the following example of my lab VM. I installed Oracle 19.12.0 on an OCI VM.Standard.E4.Flex instance running Oracle Linux 8.4/UEK6 with all patches up to August 03, 2021. Please note this is just an example, what I’m demonstrating next is true for systems running on-premises and the cloud.

[opc@oracle-19c-fs ~]$ sudo vgs -o +vg_attr | egrep 'ora|VG'
  VG         #PV #LV #SN Attr   VSize     VFree Attr  
  orabin_vg    1   1   0 wz--n-  <100.00g    0  wz--n-
  oradata_vg   2   1   0 wz--n-     1.95t    0  wz--n-
  orareco_vg   1   1   0 wz--n- <1000.00g    0  wz--n-

As you can see I have 3 VGs:

  • orabin_vg holds the (single instance) binaries
  • oradata_vg will be used for data files
  • orareco_vg … well you got it.

You may have noticed that oradata_vg is the only VG with 2 PVs, which is confirmed in the following output:

[opc@oracle-19c-fs ~]$ sudo pvs -o +pv_attr | egrep 'PV|oradata'
  PV         VG         Fmt  Attr PSize     PFree Attr
  /dev/sdb1  oradata_vg lvm2 a--  <1000.00g    0  a-- 
  /dev/sdd1  oradata_vg lvm2 a--  <1000.00g    0  a-- 
[opc@oracle-19c-fs ~]$ 

As soon as I start running Swingbench against a database with a vastly undersized SGA forcing physical I/O, iostat draws as rather unpleasant picture:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs) 	04/08/21 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.47    0.01    4.26    6.11    0.05   75.10

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              1.63    1.12      0.05      0.03     0.03     0.42  ...   0.15
dm-0             1.59    0.95      0.04      0.02     0.00     0.00  ...   0.13
dm-1             0.02    0.58      0.00      0.01     0.00     0.00  ...   0.02
sdb             85.70 2903.19      0.94     25.58     0.00     0.22  ...  71.99
sdc              0.02    0.00      0.00      0.00     0.00     0.00  ...   0.00
sdd              5.67  287.03      0.02      2.27     0.00     0.84  ...  33.25
sde              0.50    0.12      0.02      0.00     0.00     0.03  ...   0.04
dm-2             0.49    0.15      0.02      0.00     0.00     0.00  ...   0.04
dm-3             0.01    0.00      0.00      0.00     0.00     0.00  ...   0.00
dm-4            91.35 3191.47      0.97     27.85     0.00     0.00  ...  72.02

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          20.25    0.00    6.01    8.44    0.10   65.20

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              0.00    0.20      0.00      0.00     0.00     0.00  ...   0.04
dm-0             0.00    0.20      0.00      0.00     0.00     0.00  ...   0.04
sdb              9.40 4108.80      0.08     37.29     0.00     0.20  ...  99.70
sdd              0.00    0.40      0.00      0.01     0.00     0.00  ...   0.08
dm-4             9.40 4109.80      0.08     37.29     0.00     0.00  ...  99.70

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          20.85    0.00    6.00    5.75    0.06   67.34

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              0.00    1.40      0.00      0.01     0.00     0.20  ...   0.06
dm-0             0.00    0.20      0.00      0.00     0.00     0.00  ...   0.02
dm-1             0.00    1.40      0.00      0.01     0.00     0.00  ...   0.04
sdb              9.80 4098.40      0.10     36.41     0.00     1.20  ...  99.86
sdd              0.00    1.20      0.00      0.02     0.00     0.00  ...   0.20
dm-4             9.80 4100.60      0.10     36.43     0.00     0.00  ...  99.84

^C

This is of course not peak performance when it comes to the block devices, I’m running a low-key workload pumping a few I/O requests here :) You may notice that the only block device really used is /dev/sdb, the first block device in the Volume Group. /dev/dm-4 is the device mapper representation of the Logical Volume:

[opc@oracle-19c-fs ~]$ ls -l /dev/oradata_vg/
total 0
lrwxrwxrwx. 1 root root 7 Aug  4 07:47 oradata_lv -> ../dm-4
[opc@oracle-19c-fs ~]$ 

Long story short, my database performs all of its I/O requests on a single block device. From a performance point of view this is usually undesirable. I should add that I did not multiplex online redo logs or the control file into the Fast Recovery Area, creating the perfect storm. Data placement on the Logical Volume also depends on some other factors, however it’s possible for you to see negative impact on performance if you use LVM for data files and sticking to the defaults. You certainly won’t see striping across multiple block devices, which is often needed for better performance. For reference, the oradata_vg was created using the following Ansible tasks:

- name: hostconfig | partition block devices (data files)
  parted:
    device: "{{ item }}"
    number: 1
    state: present
    align: optimal
    label: gpt
  loop: "{{ database_storage_block_devs }}"

- name: hostconfig | create an "oradata" volume group
  lvg:
   vg: oradata_vg
   pvs: "{{ database_storage_block_devs | join('1,') }}1"
   state: present

- name: hostconfig | create logical volume on top of oradata_vg
  lvol:
    vg: oradata_vg
    lv: oradata_lv
    size: 100%VG
    state: present

What’s next?

In the next couple of articles I’ll describe possible workarounds to the issue of LVM usage and how to get more performance out of the system. Stay tuned!

Do I really have to set resource limits for Oracle in systemd unit files?

TL;DR: it is very much required to set resource limits in systemd unit files. If you’d like to learn more about potential directives to put into a unit file, please have a look at the earlier post I wrote for a more thorough explanation. I also wrote a short post about a potential unit file used to start an Oracle single-instance database (without Oracle Restart) with a couple more details.

Failing to set resource limits in systemd unit files can lead to outages.

Large pages are a good example

Without the Limit-directives added to the unit file the database(s) might refuse to start. Using large pages is a good example. Switching a database to enforce the use of large pages is easy:

SQL> ALTER SYSTEM SET use_large_pages='ONLY' SCOPE=SPFILE;

System altered.

When implementing this directive you tell Oracle not to start at all unless the entire Shared Global Area (SGA) fits into large pages. It also means the database won’t start if your configuration is wrong. Better get it right :)

How many large pages do I need?

On my host with a single Oracle database it couldn’t be easier to work out the number of large pages required. The alert.log tells me I need to allocate 2050 large pages if I want the SGA to fit completely. If there are multiple Oracle instances per host, you need to allocate the sum of all required large pages without over-allocating them (more on that later).

On my system 2050 large pages is comfortably below the mark of 70% of available memory. As a rule of thumb, don’t allocate more memory in large pages than 70% of your total available memory. The O/S, most processes and the PGA cannot use large pages. Allocating too many large pages is a sure way to create instability and it’s not easy to fix so please don’t do it.

Again, as there are no other Oracle instances on the host, defining the appropriate number of large pages is simple. To make the change permanent I need to either change /etc/sysctl.conf (not a good idea) or create a new file /etc/sysctl.d/99-oracle-large-pages.conf (preferred) containing the following directive:

vm.nr_hugepages = 2052

I was lucky and managed to allocate the necessary number of large pages at runtime saving the time of a reboot. Either way, the output of cat /proc/sys/vm/nr_hugepages needs to show the correct number of large pages available:

$ cat /proc/sys/vm/nr_hugepages
2052

Next I’m shutting the database down before removing LimitMEMLOCK from the unit file mentioned in my earlier post, and reloading the systemd daemon. As you might have imagined, the database doesn’t come up when the system reboots:

$ ps -ef | egrep "tns|smon"
root          37       2  0 09:30 ?        00:00:00 [netns]
oracle      1011       1  0 19:30 ?        00:00:00 /u01/app/oracle/product/19.0.0/dbhome_1/bin/tnslsnr LISTENER -inherit
vagrant     7829    7805  0 19:31 pts/0    00:00:00 grep -E --color=auto tns|smon

This shouldn’t be a surprise. The default memlock setting is 64kb, not quite enough for my 4 GB SGA:

$ whoami
vagrant
$ egrep "^Limit|Max locked memory" /proc/$$/limits  
Limit                     Soft Limit           Hard Limit           Units     
Max locked memory         65536                65536                bytes   

And sure enough, ${ORACLE_HOME}/rdbms/log/startup.log confirm the suspicion:

/u01/app/oracle/product/19.0.0/dbhome_1/bin/dbstart: Starting up database "ORCL"
Thu Jun 17 19:30:14 UTC 2021


SQL*Plus: Release 19.0.0.0.0 - Production on Thu Jun 17 19:30:15 2021
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.

SQL> Connected to an idle instance.
SQL> ORA-27125: unable to create shared memory segment
Linux-x86_64 Error: 1: Operation not permitted
Additional information: 4641
Additional information: 10485760
SQL> Disconnected

/u01/app/oracle/product/19.0.0/dbhome_1/bin/dbstart: Database instance "ORCL" warm started.

Well it wasn’t really warm started, was it ;) I had 2052 large pages available for the SGA, the problem is the shell limit: oracle isn’t allowed to lock more than the default 64k of memory in large pages since systemd doesn’t use pam_limits(8).

Restoring the LimitMEMLOCK directive fixes that problem. After changing the unit file I rebooted once more and, voila:

$ ps -ef | egrep "tns|smon"
root          37       2  0 19:37 ?        00:00:00 [netns]
oracle      1016       1  0 19:37 ?        00:00:00 /u01/app/oracle/product/19.0.0/dbhome_1/bin/tnslsnr LISTENER -inherit
oracle      2761       1  0 19:37 ?        00:00:00 ora_smon_ORCL
vagrant     4438    4412  0 19:39 pts/0    00:00:00 grep -E --color=auto tns|smon

The database is up and running, and it uses large pages:

$ grep -i ^hugepages_ /proc/meminfo
HugePages_Total:    2052
HugePages_Free:        5
HugePages_Rsvd:        3
HugePages_Surp:        0

Summary

You can use systemd on Oracle Linux 8 to start your databases along with the operating system. This is great in situations when you don’t want to/can’t use Oracle Restart and/or Oracle Real Application Clusters. Making use of systemd.directives(7) allows the database administrator to set appropriate limits for processes created by systemd as the latter doesn’t read settings in /etc/limits.d/* and /etc/limits.conf.

Using systemd to start an Oracle single instance database when booting

I don’t work with systemd frequently enough to remember its syntax and other intricacies so I thought I’d just write down how to start an Oracle 19c single instance database via systemd on Oracle Linux 8.4. Primarily so I can look it up later when I need it, but isn’t that the reason many bloggers write posts?

UPDATE 221105: the same procedure works with Oracle 21c and Oracle Linux 8.6 as well (the most current supported versions of database and O/S at the time of writing)

A little bit of background to this post

There are quite a few blog articles out there describing how to start an Oracle database when a server comes up, however some of them still use the old SysV init model, or upstart. Or a combination of both.

Since RedHat/Oracle Linux 7, systemd took over as the main component controlling the boot process (and a lot more …), so using systemd‘s native syntax and tooling sounds good to me. Please note that the technique you are about to read does not apply to Oracle Restart or Oracle Real Application Clusters. Furthermore, I haven’t had time to test any other combination than my lab VM running Oracle Linux 8.4 and Oracle database 19c (19.11.0 to be precise).

If you’d like to have some additional background information please have a look at Shell limits, Oracle Single Instance, and systemd in Oracle Linux 8.4. It explains my choice of location for the systemd unit file and some of the configuration directives.

As always, just because something works for me doesn’t mean it works for you as well, so test, test, and test some more if you like to make use of (parts of) the article.

Creating the systemd unit file

Based on the sources cited in the earlier post (such as the Red Hat 8 documentation and the always-excellent oracle-base.com) I put a unit file together. It lives in /etc/systemd/system/oracle-db.service.

[Unit]
Description=a service to start databases and listener automatically
After=syslog.target network.target

[Service]
LimitNOFILE=1024:65536
LimitNPROC=2047:16384
LimitSTACK=10485760:33554432
LimitMEMLOCK=infinity

Type=forking

User=oracle
Group=oinstall

ExecStart=/u01/app/oracle/product/19.0.0/dbhome_1/bin/dbstart /u01/app/oracle/product/19.0.0/dbhome_1
ExecStop=/u01/app/oracle/product/19.0.0/dbhome_1/bin/dbshut /u01/app/oracle/product/19.0.0/dbhome_1

RemainAfterExit=True
Restart=no

[Install]
WantedBy=multi-user.target

The unit file defines the properties of the oracle-db service. Most notably:

  • It sets the necessary process limits in the Service section
  • The service type indicates a process is forked by this unit file
  • Since dbstart and dbsthut exit once their work is completed I need to tell systemd to consider the status of the service to be active via the RemainAfterExit directive
  • I also don’t want the service to be restarted automatically
  • The unit file relies on Oracle’s dbstart and dbshut scripts to start and stop the databases on this host

The unit file assumes locally attached storage. Have a look at the article on oracle-base.com if you need NFS for an example on how to define dependencies for NFSv3. Similarly, you might want to add dependencies on iSCSI or other network storage.

The next step is to modify /etc/oratab and change the start flag for my database (there is only one, ORCL):

$ grep $ORACLE_SID /etc/oratab
ORCL:/u01/app/oracle/product/19.0.0/dbhome_1:Y

Only those entries in oratab with a Y as their start flag will be considered by dbstart and dbstop.

Hint: the above is super easy to automate with Ansible and jinja2 templates :)

Next you should execute sudo systemctl daemon-reload to make systemd aware of the new unit file. Once systemd knows about it, it should be enabled via systemctl enable oracle-db.service.

Starting and stopping the database

After a reboot I found the following entry for oracle-db.service in systemd‘s journal:

Jun 17 18:34:47 server1 systemd[1]: Starting a service to start databases and listener automatically...
Jun 17 18:34:48 server1 dbstart[973]: Processing Database instance "ORCL": log file /u01/app/oracle/product/19.0.0/dbhome_1/rdbms/log/startup.log
Jun 17 18:35:04 server1 systemd[1]: Started a service to start databases and listener automatically.

That seemed to have worked. The unit file also took care of the database when the VM was shut down:

Jun 17 18:39:50 server1 systemd[1]: Stopping a service to start databases and listener automatically...
Jun 17 18:39:50 server1 dbshut[4486]: Processing Database instance "ORCL": log file /u01/app/oracle/product/19.0.0/dbhome_1/rdbms/log/shutdown.log
Jun 17 18:40:12 server1 systemd[1]: oracle-db.service: Succeeded.
Jun 17 18:40:12 server1 systemd[1]: Stopped a service to start databases and listener automatically.

It appears as if my unit file does what I want it to do, hurray.