Category Archives: Linux

Hybrid Columnar Compression in 12.2 – nice new feature

Oracle 12.2 introduced an interesting optimisation for Hybrid Columnar Compression (HCC). Until 12.2 you had to use direct path inserts into HCC compressed segments for data to be actually compressed. If you didn’t use a direct path insert you would still succeed in entering data into the segment, however your newly inserted data was not HCC compressed. There is no error message or other warning telling you about that, which can lead to surprises for the unaware.

My friend and colleague Nick has pointed out that the official HCC white paper states – somewhat hidden – that this requirement is no longer as strict in 12.2. I haven’t managed to find the document Nick pointed out, but a quick search using my favourite retrieval engine unearthed the updated version for 18c.

The new feature is easy to miss: you read on page 3 that data must be loaded using data warehouse bulk loading (direct path) techniques. Examples of said technologies follow before the author continues with a discussion about querying HCC data. For me that was the end of the subject… Turns out it wasn’t: the nugget Nick unearthed was on page 4, in the last paragraph before the next section on “HCC Warehouse (Query) Compression”. Quoting literally from said white paper:

Starting with Oracle Database 12.2, HCC automatically compresses new data from SQL INSERT … SELECT statements, without the APPEND hint and array inserts from programmatic interfaces such as PL/SQL and the Oracle Call Interface (OCI).

Aha! Once you know what to look for you find the same information in the 12.2 new features guide, too. Sometimes it’s hard to see the wood for all those trees.

So there is no more need for the append hint for insert-select in 12.2? This would be great news, and should alleviate some issues people have from unknowingly using HCC without direct path inserts. And it’s good for concurrency! Although I do like white papers, there are times when the white paper has to be checked against the lab to see if it is actually true.

I did exactly this for this blog post.

12.2 Test Case

I am using the SOE.ORDERS table for that purpose, as it holds a fair bit of data. To see whether the new algorithm works I decided to create a new empty table ORDERS_HCC with the same table structure as SOE.ORDERS. In the next step I issue an insert-select statement. If the white paper is correct it’ll compress the data using Query High.

SQL (12.2)> show user    
USER is "SOE"
SQL (12.2)> select banner from v$version where rownum  create table orders_hcc 
  2  column store compress for query high 
  3  as select * from orders where 1 = 0; 
                                                                                                
Table created.

SQL (12.2)> insert into orders_hcc 
  2  select * from orders where rownum  commit;

Commit complete.

SQL (12.2)>

Note that I specifically omitted the /*+ append */ hint in the insert statement. Also note that the preceding CTAS statement didn’t select any rows from the source. In Oracle releases up to and including 12.1, data in ORDERS_HCC would not be compressed at the end of this little code snippet.

But how can you prove the white paper is right? Thankfully Oracle is well instrumented, and even more thankfully that instrumentation not limited to performance. The Oracle-provided dbms_compression package features a procedure called “get_compression_type()”, which allows you to pass it a ROWID and some other information and it’ll tell you the block’s compression algorithm.

Remember that you can change the compression algorithm for a given segment many times over. A partition in a range partitioned table can start uncompressed while being actively used, and as data gets colder, compression levels can be increased. Technically speaking the “alter table … column store” command on its own does not change the way data is stored in the segment. Data currently stored will remain in whatever state it was before changing the compression attribute. Only newly inserted data will be compressed according to the changed segment attribute.

Back to the example: using DBMS_COMPRESSION I would like to find out if my table data is indeed compressed for Query High after my earlier insert command. Let’s see if it is, using the first 10 rows as a rough indication.

SQL (12.2)> select dbms_compression.get_compression_type(user, 'ORDERS_HCC', rowid) ctype
  2  from ORDERS_HCC where rownum < 11;

     CTYPE
----------
	 4
	 4
	 4
	 4
	 4
	 4
	 4
	 4
	 4
	 4

10 rows selected.

Well it would appear as if these are all compressed for Query High (QH). Looking at the package definition I can see that a compression type of 4 indicates Query High.

So far so good, but I merely checked 10 out of 1 million rows. I’m fairly sure the rest of the table is also HCC compressed for QH, but I want proof. To remove any doubts, I can query the whole table. I’m doing this so you don’t have to. The next query will take forever (eg certainly more than 1 minute) to execute, and it is CPU bound so please don’t do this at work. If you really feel like having to run a query like this, don’t do it outside the lab. You have been warned :)

SQL (12.2)> with comptypes as (
  2   select rownum rn, dbms_compression.get_compression_type(user, 'ORDERS_HCC', rowid) ctype
  3     from ORDERS_HCC
  4  )
  5      select count(*), ctype from comptypes group by ctype;

  COUNT(*)      CTYPE
---------- ----------
   1000000          4

I always start these types of queries in a screen (1) session to prevent network connection issues from interrupting my long running task. After some time, the query returns with the results as you can see. The entire table is compressed with Query High.

Summary

Array-inserts into HCC segments can compress data in Oracle 12.2 even if you don’t specify the append hint. The behaviour for conventional inserts did not change. I am going to post another piece of research containing some more data later this week or next.

Advertisements

New option for configuring multipathing in Oracle Linux 7

Teaching is and remains the best way for picking up new things :) While updating notes for a class I came across an interesting change in the way the device-mapper multipath works in Oracle Linux 7.4.

In the past, everyone including me used scsi_id to get the WWID of a LUN for use with dm-multipath. This is still the way Oracle documents it for Oracle Linux 7.

The utility was specified as an argument to getuid_callout in /etc/multipath.conf, as shown here:

defaults {
[...]
    getuid_callout        "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
[...]
}

This is how it was done more or less since Red Hat 5 days. The location of scsi_id has changed over time, as expertly described on Oracle Base for example.

I don’t exactly know what happened to getuid_callout, but I can’t find it in the DM Multipath documentation for Red Hat Enterprise Linux 7. It certainly is present in its counterpart for Red Hat Enterprise Linux 6. A quick search using my favourite search engine did not reveal any clues as to what happened. The closest I got to an answer was a page in the SLES 11 SP3 documentation, which indicated that you should use a different parameter than getuid_callout.

Now I was curious! If I continued using getuid_callout, would the service stop working? Or just throw a warning?

The experiment

I quickly spun up a VM on KVM, running Oracle Linux 7.4 which I updated to the latest and greatest version of each package as of April 3rd 2018. The current versions of dm-multipath as of April 3rd 2018 are:

[root@server6 ~]# rpm -qa | grep -i multipath
device-mapper-multipath-0.4.9-111.el7_4.2.x86_64
device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64
[root@server6 ~]# 

I have presented a small “LUN” to the VM, using the virtual SCSI driver. I initially tried with the virtio driver but found it was not exporting necessary properties back to the VM that I need later in this post. It shouldn’t matter though, all of this is a playground and I am most interested in the concepts, not the details.

You will also notice that I am only showing snippets of the multipath.conf file. The reason for this is simple: if you don’t follow your storage vendor’s documentation detailing how to configure dm-multipath for the storage system in use, you might run into issues. Always consult your storage vendor’s official documentation!

Installing dm-multipath

I didn’t have multipathing installed initially and had to install it first. Here is the transcript:

[root@server6 ~]# yum install device-mapper-multipath

[...]

--> Running transaction check
---> Package boost-system.x86_64 0:1.53.0-27.el7 will be installed
---> Package boost-thread.x86_64 0:1.53.0-27.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================================
 Package                            Arch         Version                 Repository        Size
================================================================================================
Installing:
 device-mapper-multipath            x86_64       0.4.9-111.el7_4.2       ol7_latest       134 k
Installing for dependencies:
 boost-system                       x86_64       1.53.0-27.el7           ol7_latest        39 k
 boost-thread                       x86_64       1.53.0-27.el7           ol7_latest        57 k
 device-mapper-multipath-libs       x86_64       0.4.9-111.el7_4.2       ol7_latest       251 k
 librados2                          x86_64       1:0.94.5-2.el7          ol7_latest       1.7 M
Updating for dependencies:
 kpartx                             x86_64       0.4.9-111.el7_4.2       ol7_latest        72 k

Transaction Summary
================================================================================================
Install  1 Package  (+4 Dependent packages)
Upgrade             ( 1 Dependent package)

Total download size: 2.2 M

[...]

Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : boost-system-1.53.0-27.el7.x86_64                                            1/7 
  Installing : boost-thread-1.53.0-27.el7.x86_64                                            2/7 
  Installing : 1:librados2-0.94.5-2.el7.x86_64                                              3/7 
  Installing : device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64                        4/7 
  Updating   : kpartx-0.4.9-111.el7_4.2.x86_64                                              5/7 
  Installing : device-mapper-multipath-0.4.9-111.el7_4.2.x86_64                             6/7 
  Cleanup    : kpartx-0.4.9-111.el7.x86_64                                                  7/7 
  Verifying  : boost-system-1.53.0-27.el7.x86_64                                            1/7 
  Verifying  : 1:librados2-0.94.5-2.el7.x86_64                                              2/7 
  Verifying  : boost-thread-1.53.0-27.el7.x86_64                                            3/7 
  Verifying  : device-mapper-multipath-0.4.9-111.el7_4.2.x86_64                             4/7 
  Verifying  : kpartx-0.4.9-111.el7_4.2.x86_64                                              5/7 
  Verifying  : device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64                        6/7 
  Verifying  : kpartx-0.4.9-111.el7.x86_64                                                  7/7 

Installed:
  device-mapper-multipath.x86_64 0:0.4.9-111.el7_4.2                                            

Dependency Installed:
  boost-system.x86_64 0:1.53.0-27.el7                      boost-thread.x86_64 0:1.53.0-27.el7 
  device-mapper-multipath-libs.x86_64 0:0.4.9-111.el7_4.2  librados2.x86_64 1:0.94.5-2.el7     

Dependency Updated:
  kpartx.x86_64 0:0.4.9-111.el7_4.2                                                             

Complete!
[root@server6 ~]# 

First configuration attempt

Next I quickly created a config file based on what I used in the past, making sure I add getuid_callout in the defaults {} section. And sure enough, when starting the service it complains about illegal syntax:

[root@server6 ~]# systemctl status multipathd
● multipathd.service - Device-Mapper Multipath Device Controller
   Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-04-03 14:56:19 BST; 1min 28s ago
  Process: 586 ExecStart=/sbin/multipathd (code=exited, status=0/SUCCESS)
  Process: 568 ExecStartPre=/sbin/multipath -A (code=exited, status=0/SUCCESS)
  Process: 564 ExecStartPre=/sbin/modprobe dm-multipath (code=exited, status=0/SUCCESS)
 Main PID: 596 (multipathd)
   CGroup: /system.slice/multipathd.service
           └─596 /sbin/multipathd

Apr 03 14:56:19 server6 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Apr 03 14:56:19 server6 multipath[568]: Apr 03 14:56:19 | /etc/multipath.conf line 24, invalid keyword: getuid_callout
Apr 03 14:56:19 server6 systemd[1]: Started Device-Mapper Multipath Device Controller.
Apr 03 14:56:19 server6 multipathd[596]: /etc/multipath.conf line 24, invalid keyword: getuid_callout

Note the line reading “/etc/multipath.conf line 24, invalid keyword: getuid_callout”. It didn’t seem to prevent the meta-device from being created, but I didn’t like the error message popping up all the time.

After removing the getuid_callout from my multipath.conf file to let the defaults take over, the error went away after a restart of the service:

Apr 03 15:02:54 server6 systemd[1]: Stopped Device-Mapper Multipath Device Controller.

Apr 03 15:10:43 server6 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Apr 03 15:10:43 server6 systemd[1]: Started Device-Mapper Multipath Device Controller.
Apr 03 15:10:43 server6 multipathd[11593]: mpatha: load table [0 4194304 multipath 0 0 1 1 service-time 0 1 1 8:0 1]
Apr 03 15:10:43 server6 multipathd[11593]: mpatha: event checker started
Apr 03 15:10:43 server6 multipathd[11593]: path checkers start up

So I guess I have to rethink my approach to dm-multipath and device name persistence if I want to avoid this error message.

The New Way of doing things

I have once been told by a good friend: “when unsure, consult the documentation”. On my Linux system, my first stop is the man page. After a quick “man multipath.conf” I learned that getuid_callout is nowhere to be found. The SUSE note mentioned a new attribute, named uid_attribute, which seems to have taken its place:

uid_attribute:
The udev attribute providing a unique path identifier. Default value is ID_SERIAL

I also found this to be valuable about the “multipaths” section:

This section defines the multipath topologies. They are indexed by a World Wide Identifier (wwid), which is taken to be the value of the udev attribute given by the uid_attribute keyword.

Aha! The new model is potentially much more flexible than the old one.

Controlling aliases with multipath {} sections

I really like to assign alias names to LUNs mapped to my database systems. This way, I know that a device named “/dev/mapper/oracle_asm_data_001” is most likely used as the first ASM disk in disk group DATA. If I see /dev/mapper/mpatha instead, I know precisely nothing about its purpose.

Giving LUNs a name is not hard, you can do this in the multipaths {} section of /etc/multipath.conf. I really like doing this for reasons just mentioned. Which options do I have?

The easiest one is quite simply a change of getuid_callout to uid_attribute or its removal. Thankfully – at least in my case – the output of scsi_id and the ID_SERIAL attribute provided by udev are nearly identical:

[root@server6 ~]# /lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/sda
0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0
[root@server6 ~]# udevadm info /dev/sda | grep ID_SERIAL
E: ID_SERIAL=0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0
E: ID_SERIAL_SHORT=drive-scsi0-0-0-0
[root@server6 ~]# 

Note that I included the –replace-whitespace option in the scsi_id command, otherwise the output wouldn’t be identical. This is a concession to my lab, on real hardware I haven’t seen white space in WWIDs (this doesn’t mean there aren’t). Removing the getuid_callout from the default {} section was actually enough to create a persistent device name.

[...]
multipaths {

  multipath {
    wwid  "0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0"
    alias "oracle_asm_data_001"
  }
}

And yes, I get my alias after restarting the multipathing daemon.

Apr 03 16:40:49 server6 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Apr 03 16:40:49 server6 systemd[1]: Started Device-Mapper Multipath Device Controller.
Apr 03 16:40:49 server6 multipathd[12271]: oracle_asm_data_001: load table [0 4194304 multipath 0 0 1 1 service-time 0 1 1 8:0 1]
Apr 03 16:40:49 server6 multipathd[12271]: oracle_asm_data_001: event checker started
Apr 03 16:40:49 server6 multipathd[12271]: path checkers start up

[root@server6 ~]# multipath -l
oracle_asm_data_001 (0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0) dm-3 QEMU    ,QEMU HARDDISK   
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  `- 0:0:0:0 sda 8:0 active undef unknown
[root@server6 ~]# 

There is of course no second path, and there is a fair bit of information missing that your would otherwise get from an enterprise deployment, but you get the idea.

You may still need gcc when patching Oracle Database 12.2

I have previously written about changes in the Oracle 12.2 preinstall RPM and how gcc is no longer part of the dependencies list. As was pointed out to me, this shouldn’t be necessary anymore, according to the 12.2 Linux Database Installation Guide. Check the blue note for a statement indicating that gcc and gcc-c++ aren’t needed for Grid Infrastructure, nor for the RDBMS software.

I have applied patch 27100009 (January 2018 Release Update 12.2.0.1.180116) on my 2 node RAC system in the lab, and found out that this is partially true :) You may or may not encounter this issue in your environment, see below.

Updating the preinstall RPM

Just to be sure I didn’t miss any changes to the preinstall RPM I pulled the latest one from Oracle and checked it’s requirements for gcc.

[oracle@rac122pri1 ~]$ rpm -qi oracle-database-server-12cR2-preinstall-1.0-3.el7.x86_64
Name        : oracle-database-server-12cR2-preinstall
Version     : 1.0
Release     : 3.el7
Architecture: x86_64
Install Date: Fri 19 Jan 2018 03:13:18 PM GMT
Group       : Test Environment/Libraries
Size        : 56561
License     : GPLv2
Signature   : RSA/SHA256, Mon 10 Jul 2017 11:27:07 AM BST, Key ID 72f97b74ec551f03
Source RPM  : oracle-database-server-12cR2-preinstall-1.0-3.el7.src.rpm
Build Date  : Mon 10 Jul 2017 11:26:59 AM BST
Build Host  : x86-ol7-builder-02.us.oracle.com
Relocations : (not relocatable)
Vendor      : Oracle
Summary     : Sets the system for Oracle Database single instance and Real Application
              Cluster install for Oracle Linux 7
Description :
The Oracle Preinstallation RPM package installs software packages and sets system parameters 
required for Oracle Database single instance and Oracle Real Application Clusters installations 
for Oracle Linux Release 7
Files affected: /etc/sysctl.conf, /boot/grub/menu.lst OR /boot/grub2/grub.cfg
Files added: /etc/security/limits.d/oracle-database-server-12cR2-preinstall.conf
[oracle@rac122pri1 ~]$ rpm -q --requires oracle-database-server-12cR2-preinstall-1.0-3.el7.x86_64 | grep -i gcc
libgcc
[oracle@rac122pri1 ~]$ 

So gcc is definitely not part of the dependent packages, and my minimum install doesn’t pull gcc (or gcc-c++ for that matter):

[oracle@rac122pri1 ~]$ rpm -qa | grep gcc
libgcc-4.8.5-16.el7_4.1.x86_64
[oracle@rac122pri1 ~]$ 

This is a 2 node cluster, consisting of nodes rac122pri1 and rac122pri2. Both use Oracle Linux 7.4 with UEK4 patched to February 12th 2018. As the names I picked for my cluster nodes suggest, this is a system running RAC 12.2.

IMPORTANT note: just because I hit the issue doesn’t mean there’s an issue for everyone. My lab environment is that – my lab environment. I just wanted to point out a way to investigate the problem and how I resolved it in my lab environment. Your mileage might vary.

Applying the January 2018 RU

I always like to be current when it comes to Oracle, and in that spirit decided to apply the January 2018 RU to my cluster after having made sure that I have a working backup. My rule is that there is no patching without backups, ever. And by backups I mean backups of the entire stack ;)

Since this was a new installation in my lab I thought I’d give opatchauto another chance and run with it. What could possibly go wrong?

Node 1 went without any issues, and opatchauto reported that it had applied the patches it wanted to apply. I don’t have the screen output anymore, however here’s the summary after the patch completed.

[oracle@rac122pri1 ~]$ opatch lspatches -oh /u01/app/oracle/product/12.2.0.1/dbhome_1
27335416;OCW JAN 2018 RELEASE UPDATE 12.2.0.1.180116 (27335416)
27105253;Database Release Update : 12.2.0.1.180116 (27105253)

OPatch succeeded.

[oracle@rac122pri1 ~]$ opatch lspatches -oh /u01/app/12.2.0.1/grid/
27335416;OCW JAN 2018 RELEASE UPDATE 12.2.0.1.180116 (27335416)
27144050;Tomcat Release Update 12.2.0.1.0(ID:171023.0830) (27144050)
27128906;ACFS Release Update : 12.2.0.1.0 (27128906)
27105253;Database Release Update : 12.2.0.1.180116 (27105253)
26839277;DBWLM RELEASE UPDATE 12.2.0.1.0(ID:170913) (26839277)

OPatch succeeded.
[oracle@rac122pri1 ~]$

Did you notice the “Tomcat Release Update 12.2.0.1.0” in the Grid Home? So much to investigate!

Patching node 2

After the patch completed on node 1 I continued with node 2. At first, everything looked quite normal, and opatch did it’s thing. I have taken a habit to always apply anything that depends on the network connection to be up in a screen (1) session. That protects the work I’m doing from intermittent network trouble, and allows me to keep a view on how things are progressing.

After a few minutes of work, opatchauto reported this (formatted for readability):

Start applying binary patch on home /u01/app/12.2.0.1/grid
Failed while applying binary patches on home /u01/app/12.2.0.1/grid

Execution of [OPatchAutoBinaryAction] patch action failed, check log for more details. Failures:
Patch Target : rac122pri2->/u01/app/12.2.0.1/grid Type[crs]
Details: [
---------------------------Patching Failed---------------------------------
Command execution failed during patching in home: /u01/app/12.2.0.1/grid, host: rac122pri2.
Command failed:  /u01/app/12.2.0.1/grid/OPatch/opatchauto  apply /u01/stage/27100009 -oh /u01/app/12.2.0.1/grid
    -target_type cluster -binary -invPtrLoc /u01/app/12.2.0.1/grid/oraInst.loc 
    -jre /u01/app/12.2.0.1/grid/OPatch/jre 
    -persistresult /u01/app/12.2.0.1/grid/OPatch/auto/dbsessioninfo/sessionresult_rac122pri2_crs.ser 
    -analyzedresult /u01/app/12.2.0.1/grid/OPatch/auto/dbsessioninfo/sessionresult_analyze_rac122pri2_crs.ser
Command failure output:
==Following patches FAILED in apply:

Patch: /u01/stage/27100009/27335416
Log: /u01/app/12.2.0.1/grid/cfgtoollogs/opatchauto/core/opatch/opatch2018-01-22_09-37-57AM_1.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Re-link fails on target "install_srvm".

Not exactly what you want to see during patching. But there’s no need to panic (just yet ;) The error message points me to a log file. Opening the log file it quickly become apparent why the issue occurred:

[22-Jan-2018 09:40:30] [INFO]       [OPSR-TIME] Finished applying patch "27335416" to local system
[22-Jan-2018 09:40:31] [INFO]       [OPSR-TIME] Loading raw inventory
[22-Jan-2018 09:40:31] [INFO]       [OPSR-MEMORY] Loaded all components from inventory. Heap memory in use: 314 (MB)
[22-Jan-2018 09:40:31] [INFO]       [OPSR-MEMORY] Loaded all one offs from inventory. Heap memory in use: 314 (MB)
[22-Jan-2018 09:40:31] [INFO]       [OPSR-TIME] Raw inventory loaded successfully
[22-Jan-2018 09:40:31] [INFO]       [OPSR-TIME] Loading cooked inventory
[22-Jan-2018 09:40:31] [INFO]       [OPSR-MEMORY] : Loading cooked one offs. Heap memory used 314 (MB)
[22-Jan-2018 09:40:32] [INFO]       [OPSR-MEMORY] : Loaded cooked oneoffs. Heap memory used : 363 (MB)
[22-Jan-2018 09:40:32] [INFO]       [OPSR-TIME] Cooked inventory loaded successfully
[22-Jan-2018 09:40:32] [INFO]       OUI-67050:Running make for target install_srvm
[22-Jan-2018 09:40:32] [INFO]       Start invoking 'make' at Mon Jan 22 09:40:32 GMT 2018Mon Jan 22 09:40:32 GMT 2018
[22-Jan-2018 09:40:32] [INFO]       Finish invoking 'make' at Mon Jan 22 09:40:32 GMT 2018
[22-Jan-2018 09:40:32] [WARNING]    OUI-67200:Make failed to invoke "/usr/bin/make -f ins_srvm.mk install_srvm 
                                    ORACLE_HOME=/u01/app/12.2.0.1/grid"....'/bin/sh: /usr/bin/gcc: No such file or directory
                                    make: *** [/u01/app/12.2.0.1/grid/rdbms/lib/config.o] Error 127
                                    '
[22-Jan-2018 09:40:32] [INFO]       Stack Description: java.lang.RuntimeException: /bin/sh: /usr/bin/gcc: No such file or directory
                                    make: *** [/u01/app/12.2.0.1/grid/rdbms/lib/config.o] Error 127
[22-Jan-2018 09:40:32] [INFO]       StackTrace: oracle.opatch.MakeAction.apply(MakeAction.java:534)

This is actually quite simple: for some reason opatch wants to use gcc, and doesn’t find it. After installing gcc and dependencies, I resumed opatchauto and finished patching successfully. I haven’t understood why Oracle wants to use gcc on node 2 after it didn’t require it on node 1.

Just thought I’d pass this on in case you hit the same problem. Happy patching!

OSWatcher, Tracefile Analyzer, and Oracle Restart 12.2

You are about to read the second part of this mini-series on TFA and Oracle 12.2. In the previous article I wrote about TFA and Oracle 12.2 single instance. In this short article I am going to have a look at TFA in a 12.2 Oracle Restart environment before rounding it up with an investigation into a full-blown RAC installation in part 3.

Summarising the first part I can only say that I am very happy that we now get TFA as part of the standard installation. Running it in daemon mode provides some great insights, and even if you did not upgrade the installation to “MOS-TFA”, you have a very fine tool for Oracle troubleshooting at your disposal.

Summary of the environment

My environment is largely the same as last time, except the machine name changed to server4 and I have additional storage for use with ASM.

  • It’s still Oracle Linux 7.4 with UEK4
  • Oracle 12.2.0.1.0 Grid Infrastructure was installed first
  • Followed by an installation of the Oracle 12.2.0.1.0 RDBMS EE software
  • After having created a database in ASM I applied the January 2018 GI RU
  • Finally I upgraded TFA to the current version (as downloaded from My Oracle Support DOC ID 1513912.1)

Not all of these steps are relevant for this article though.

Now where would you find TFA?

The question I had when creating this environment was essentially this: where would I find TFA? Would it be part of the Grid Home, or rather the RDBMS installation?

After having installed the binaries for Grid Infrastructure, I didn’t find a reference to roottfa.sh in the Grid Home’s root.sh script.

[oracle@server4 ~]$ cat /u01/app/oracle/product/12.2.0.1/grid/root.sh
#!/bin/sh
unset WAS_ROOTMACRO_CALL_MADE
. /u01/app/oracle/product/12.2.0.1/grid/install/utl/rootmacro.sh "$@"
. /u01/app/oracle/product/12.2.0.1/grid/install/utl/rootinstall.sh

#
# Root Actions related to network
#
/u01/app/oracle/product/12.2.0.1/grid/network/install/sqlnet/setowner.sh 

#
# Invoke standalone rootadd_rdbms.sh
#
/u01/app/oracle/product/12.2.0.1/grid/rdbms/install/rootadd_rdbms.sh

/u01/app/oracle/product/12.2.0.1/grid/rdbms/install/rootadd_filemap.sh 
/u01/app/oracle/product/12.2.0.1/grid/crs/config/rootconfig.sh $@ 
EXITCODE=$? 
if [ $EXITCODE -ne 0 ]; then
	exit $EXITCODE
fi

[oracle@server4 ~]$ 

I checked whether TFA has been configured in a different place, but there wasn’t any TFA-related process running nor was there a systemd unit file with *tfa* in its name. So it looks like you don’t get the option to install TFA automatically as part of the Grid Infrastructure installation.

Not finding TFA configured to run out of Grid Infrastructure surprised me, especially since Oracle states in the Autonomous Health Framework documentation chapter 4.2.2 that TFA is automatically configured as part of the GI configuration, upon invoking root.sh or rootupgrade.sh. According to my testing, this is true for RAC, but not Oracle Restart.

Which kind-of makes sense if you consider that most users will install the database software anyway in an Oracle Restart setup.

It doesn’t really matter, read on ;)

Installing the RDBMS software

If TFA isn’t configured automatically with Grid Infrastructure in an Oracle Restart configuration it should come as no surprise that TFA is once more installed and configured from the RDBMS home. Just as with the single instance installation I wrote about previously, you find a reference to roottfa.sh in the RDBMS home’s root.sh:

[oracle@server4 tfa]$ cat /u01/app/oracle/product/12.2.0.1/dbhome_1/root.sh 
#!/bin/sh
unset WAS_ROOTMACRO_CALL_MADE
. /u01/app/oracle/product/12.2.0.1/dbhome_1/install/utl/rootmacro.sh "$@"
. /u01/app/oracle/product/12.2.0.1/dbhome_1/install/utl/rootinstall.sh
/u01/app/oracle/product/12.2.0.1/dbhome_1/suptools/tfa/release/tfa_home/install/roottfa.sh
/u01/app/oracle/product/12.2.0.1/dbhome_1/install/root_schagent.sh

#
# Root Actions related to network
#
/u01/app/oracle/product/12.2.0.1/dbhome_1/network/install/sqlnet/setowner.sh 

#
# Invoke standalone rootadd_rdbms.sh
#
/u01/app/oracle/product/12.2.0.1/dbhome_1/rdbms/install/rootadd_rdbms.sh

/u01/app/oracle/product/12.2.0.1/dbhome_1/rdbms/install/rootadd_filemap.sh 
[oracle@server4 tfa]$ 

The procedure appears to be almost exactly the same as with single instance Oracle. I have again opted to use TFA in daemon mode, just as I did before in my post about Oracle single instance.

TFA thankfully discovers and adds both RDBMS as well as Grid Infrastructure directories (and quite a few Linux related directories, too!). You can see for yourself when running tfactl print directories.

As the final step you might want to consider upgrading TFA to the MOS version. That’s it-happy troubleshooting!

OSWatcher, Tracefile Analyzer, and Oracle 12.2 single instance

I have previously written about TFA, OSWatcher et all for Oracle 12.1. Since then, a lot of things have happened and I had an update for 12.2 on my to-do list for far too long. Experience teaches me that references to support notes and official documentation get out of date rather quickly, so as always, if you find anything that changed please let me know via the comments section and I’ll update the post.

This is going to be a 3 part mini-series to save you having to go over 42 pages of text … In this first part I’m going to have a look at single instance Oracle. In part 2 I’ll have a look at Oracle Restart environments, and finally in part 3 I’ll finish the series by looking at a 12.2 RAC system.

The environment

I am using a small VM to install Oracle 12.2.0.1.0 (initially) on Oracle Linux 7.4 with kernel UEK4. As always, my EE database binaries go into /u01/app/oracle/product/12.2.0.1/dbhome_1.

The installation/testing (and reporting) of my findings are following this approach:

  • Install the O/S
  • Install Oracle 12.2.0.1.0 EE
  • Create an EE database (not shown here)
  • Patch binaries and database to 12.2.0.1.180116
  • Upgrade TFA to 12.2.1.3.1 as downloaded from My Oracle Support DOC ID 1513912.1

These were the current versions at the time of writing.

Install Oracle 12.2.0.1.0

The first step after the O/S is provisioned is to install the Oracle software, obviously. I have noticed that TFA is part of the Oracle binaries. Towards the end of the installation process, you are prompted to execute root.sh, as normal. On my system, root.sh had the following contents:

      1 #!/bin/sh
      2 unset WAS_ROOTMACRO_CALL_MADE
      3 . /u01/app/oracle/product/12.2.0.1/dbhome_1/install/utl/rootmacro.sh "$@"
      4 . /u01/app/oracle/product/12.2.0.1/dbhome_1/install/utl/rootinstall.sh
      5 /u01/app/oracle/product/12.2.0.1/dbhome_1/suptools/tfa/release/tfa_home/install/roottfa.sh
      6 /u01/app/oracle/product/12.2.0.1/dbhome_1/install/root_schagent.sh
      7 
      8 #
      9 # Root Actions related to network
     10 #
     11 /u01/app/oracle/product/12.2.0.1/dbhome_1/network/install/sqlnet/setowner.sh
     12 
     13 #
     14 # Invoke standalone rootadd_rdbms.sh
     15 #
     16 /u01/app/oracle/product/12.2.0.1/dbhome_1/rdbms/install/rootadd_rdbms.sh
     17 
     18 /u01/app/oracle/product/12.2.0.1/dbhome_1/rdbms/install/rootadd_filemap.sh

After a few variables are set/defined by sourcing in files created during the installation, roottfa.sh is called (see line 5). It allows you to configure TFA to run as a background (daemon) process. I decided to go with that option after consulting chapter 4 in the 12.2 Autonomous Health Framework documentation and reading about the advantages of using TFA as a daemon. This may or may not be the right way to run TFA for you, the documentation is really good and helps you decide. Here is the transcript of my root.sh execution:

[root@server5 ~]#  /u01/app/oracle/product/12.2.0.1/dbhome_1/root.sh
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME=  /u01/app/oracle/product/12.2.0.1/dbhome_1

Enter the full pathname of the local bin directory: [/usr/local/bin]:
   Copying dbhome to /usr/local/bin ...
   Copying oraenv to /usr/local/bin ...
   Copying coraenv to /usr/local/bin ...


Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Do you want to setup Oracle Trace File Analyzer (TFA) now ? yes|[no] :
yes
Installing Oracle Trace File Analyzer (TFA).
Log File: /u01/app/oracle/product/12.2.0.1/dbhome_1/install/root_server5_2018-01-22_17-21-41-005116657.log
Finished installing Oracle Trace File Analyzer (TFA)

Once that message is shown, TFA is configured and controlled via a systemd unit file:

[root@server5 ~]# systemctl cat oracle-tfa
# /etc/systemd/system/oracle-tfa.service
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
# Oracle TFA startup
#
[Unit]
Description=Oracle Trace File Analyzer
After=syslog.target
[Service]
ExecStart=/etc/init.d/init.tfa run >/dev/null 2>&1 </dev/null
Type=simple
Restart=always

[Install]
WantedBy=multi-user.target graphical.target

The service is enabled and running.

After the completion of roottfa.sh, TFA resides in $ORACLE_BASE/tfa and its subdirectories. This is documented in the 12.2 Autonomous Health Framework chapter 4.2.3 and has an interesting implication: if you set your environment using oraenv, you might find that you get errors invoking tfactl, such as these on my VM. I have used a “minimum install” for my operating system and quite specifically didn’t add any additional perl modules in my kickstart file. Now, when invoking tfactl after having set my environment using oraenv, I find that there are missing perl modules in my system’s perl installation:

[oracle@server5 ~]$ . oraenv
ORACLE_SID = [NCDB] ? NCDB
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@server5 ~]$ tfactl status
Can't locate Digest/MD5.pm in @INC (@INC contains: 
/usr/local/lib64/perl5 /usr/local/share/perl5 
/usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl
/usr/lib64/perl5 /usr/share/perl5 . /u01/app/oracle/tfa/server5/tfa_home/bin 
/u01/app/oracle/tfa/server5/tfa_home/bin/common /u01/app/oracle/tfa/server5/tfa_home/bin/modules 
/u01/app/oracle/tfa/server5/tfa_home/bin/common/exceptions) 
at /u01/app/oracle/tfa/server5/tfa_home/bin/common/tfactlshare.pm line 7628.
BEGIN failed--compilation aborted at /u01/app/oracle/tfa/server5/tfa_home/bin/common/tfactlshare.pm line 7628.
Compilation failed in require at /u01/app/oracle/tfa/server5/tfa_home/bin/tfactl.pl line 223.
BEGIN failed--compilation aborted at /u01/app/oracle/tfa/server5/tfa_home/bin/tfactl.pl line 223.

The output has been changed for readability (originally I was missing Data::Dumper as well). After studying the documentation (still section 4.2.3 in the aforementioned document), it turns out to be a user mistake. As I said before, after TFA is configured using roottfa.sh as part of the root.sh script execution, it runs in daemon mode and crucially, is available from $ORACLE_BASE/tfa. I found that location being referred to in /etc/init.d/init.tfa as well. When I simply typed “tfactl” into my terminal window, I invoked a different “tfactl”. There is a lot more to be said about this, and I will try and do so in a different post.

NB: the same section 4.2.3 in the documentation states that even should you not run TFA in daemon mode, you can still make use of “user mode TFA” in $ORACLE_HOME, although there are certain restrictions. I haven’t pursued that route.

Anyway, after switching to the location where TFA is actually installed ($ORACLE_BASE/tfa), all is well. It seems that running roottfa.sh creates a new “Oracle” perl:

[root@server5 ~]# find /u01/app/oracle/tfa -name perl
/u01/app/oracle/tfa/server5/tfa_home/perl
/u01/app/oracle/tfa/server5/tfa_home/perl/bin/perl

I found Digest::MD5 and Data::Dumper in /u01/app/oracle/tfa/server5/tfa_home/perl/lib/5.22.0/x86_64-linux-thread-multi.

So let’s try and get the status of the current installation from $ORACLE_BASE/tfa:

[oracle@server5 ~]$ /u01/app/oracle/tfa/bin/tfactl status

Access Denied: Only TFA Admin can run this command

Nearly there: the perl modules are no longer reported to be missing, the “Oracle” perl installation appears to be used now. But what about this error message? I read in section 4.2.4 “Securing Access to Oracle Trace File Analyzer” (still referring to the Autonomous Health Framework manual) that access to TFA is restricted. However, the RDBMS owner should have been granted access automatically.

Using the commands shown in the manual I checked permissions and it turns out that the oracle user is configured to have access to TFA.

[root@server5 ~]# /u01/app/oracle/tfa/bin/tfactl access lsusers
.---------------------------------.
|       TFA Users in server5      |
+-----------+-----------+---------+
| User Name | User Type | Status  |
+-----------+-----------+---------+
| oracle    | USER      | Allowed |
'-----------+-----------+---------'

In fact, I can query TFA’s status using the “print status” command as oracle (/u01/app/oracle/tfa/bin/tfactl print status). I compared the output of “tfactl -help” between oracle and root, and there are more options available when running as root. This might explain the above error.

What is the status now?

TFA is now set up and working, but using the base release:

[root@server5 ~]# /u01/app/oracle/tfa/bin/tfactl status

.------------------------------------------------------------------------------------------------.
| Host    | Status of TFA | PID   | Port  | Version    | Build ID             | Inventory Status |
+---------+---------------+-------+-------+------------+----------------------+------------------+
| server5 | RUNNING       | 18786 | 41482 | 12.2.1.0.0 | 12210020161122170355 | COMPLETE         |
'---------+---------------+-------+-------+------------+----------------------+------------------'

It should probably be patched to something more recent. I’ll try that in 2 ways: first by applying the January 2018 RU to see if the version changes. Since the standard deployment doesn’t come with OSWatcher which I’m particularly interested in, I’ll download and apply TFA 12.2.1.3.1 next. As with all patching, I need to make sure that I have working backups which I’m comfortable restoring should anything go badly wrong.

Status after applying the January RU

A combination of opatch/datapatch later, my system is on the latest RU patchlevel:

[oracle@server5 OPatch]$ opatch lspatches
27105253;Database Release Update : 12.2.0.1.180116 (27105253)

OPatch succeeded.

However, this did not have an effect on the version of TFA in $ORACLE_BASE:

[root@server5 ~]# systemctl restart oracle-tfa
[root@server5 ~]# /u01/app/oracle/tfa/bin/tfactl status

.------------------------------------------------------------------------------------------------.
| Host    | Status of TFA | PID   | Port  | Version    | Build ID             | Inventory Status |
+---------+---------------+-------+-------+------------+----------------------+------------------+
| server5 | RUNNING       | 24042 | 37226 | 12.2.1.0.0 | 12210020161122170355 | COMPLETE         |
'---------+---------------+-------+-------+------------+----------------------+------------------'

Not quite what I expected after reading the docs: the installation of the latest RU should have updated TFA as well. But maybe I got something wrong on my end. The RU readme did not have any reference to TFA that I could find.

Yet it doesn’t matter: I wanted to have all the great support tools anyway (and they aren’t shipped with “stock TFA”), so it was time to install the latest version from MOS.

Upgrading TFA using 12.2.1.3.1 (MOS)

The patch is quite simple and well documented. If TFA is up and running in daemon mode as in my example, the patching tool will recognise that fact and patch the installation in-place. After a couple of minutes on my VM, I have a new version:

[root@server5 stage]# /u01/app/oracle/tfa/bin/tfactl status

.------------------------------------------------------------------------------------------------.
| Host    | Status of TFA | PID   | Port  | Version    | Build ID             | Inventory Status |
+---------+---------------+-------+-------+------------+----------------------+------------------+
| server5 | RUNNING       | 28105 | 39100 | 12.2.1.3.1 | 12213120171215143839 | COMPLETE         |
'---------+---------------+-------+-------+------------+----------------------+------------------'

The MOS version comes with lots of useful tools as well:

[oracle@server5 stage]$ /u01/app/oracle/tfa/bin/tfactl toolstatus

.------------------------------------------------------------------.
|                   TOOLS STATUS - HOST : server5                  |
+----------------------+--------------+--------------+-------------+
| Tool Type            | Tool         | Version      | Status      |
+----------------------+--------------+--------------+-------------+
| Development Tools    | orachk       |   12.2.0.1.3 | DEPLOYED    |
|                      | oratop       |       14.1.2 | DEPLOYED    |
+----------------------+--------------+--------------+-------------+
| Support Tools Bundle | darda        | 2.10.0.R6036 | DEPLOYED    |
|                      | oswbb        |        8.1.2 | RUNNING     |
|                      | prw          | 12.1.13.11.4 | NOT RUNNING |
+----------------------+--------------+--------------+-------------+
| TFA Utilities        | alertsummary |   12.2.1.1.0 | DEPLOYED    |
|                      | calog        |   12.2.0.1.0 | DEPLOYED    |
|                      | changes      |   12.2.1.1.0 | DEPLOYED    |
|                      | dbglevel     |   12.2.1.1.0 | DEPLOYED    |
|                      | events       |   12.2.1.1.0 | DEPLOYED    |
|                      | grep         |   12.2.1.1.0 | DEPLOYED    |
|                      | history      |   12.2.1.1.0 | DEPLOYED    |
|                      | ls           |   12.2.1.1.0 | DEPLOYED    |
|                      | managelogs   |   12.2.1.1.0 | DEPLOYED    |
|                      | menu         |   12.2.1.1.0 | DEPLOYED    |
|                      | param        |   12.2.1.1.0 | DEPLOYED    |
|                      | ps           |   12.2.1.1.0 | DEPLOYED    |
|                      | pstack       |   12.2.1.1.0 | DEPLOYED    |
|                      | summary      |   12.2.1.1.0 | DEPLOYED    |
|                      | tail         |   12.2.1.1.0 | DEPLOYED    |
|                      | triage       |   12.2.1.1.0 | DEPLOYED    |
|                      | vi           |   12.2.1.1.0 | DEPLOYED    |
'----------------------+--------------+--------------+-------------'

Note :-
  DEPLOYED    : Installed and Available - To be configured or run interactively.
  NOT RUNNING : Configured and Available - Currently turned off interactively.
  RUNNING     : Configured and Available.

[oracle@server5 stage]$ 

Since I care a lot about OSWatcher, I was very pleased to see it running.

[oracle@server5 stage]$ ps -ef | grep -i osw
oracle   28344     1  0 10:58 ?        00:00:00 /bin/sh ./OSWatcher.sh 30 48 NONE /u01/app/oracle/tfa/repository/suptools/server5/oswbb/oracle/archive
oracle   28934 28344  0 10:58 ?        00:00:00 /bin/sh ./OSWatcherFM.sh 48 /u01/app/oracle/tfa/repository/suptools/server5/oswbb/oracle/archive
oracle   30662 27252  0 11:01 pts/4    00:00:00 grep --color=auto -i osw
[oracle@server5 stage]$ 

Kindly refer to the documentation for more information about TFA. It’s quite a collection of tools, and it helps you in so many ways…

Little things worth knowing: redo transport in Data Guard 12.2 part 2

In the first part of this article I looked at a number of views and some netstat output to show how redo is transported from the primary database to its standby systems. The long story short is that TT02 (“async ORL multi”) was found sending redo to CDB3 asynchronously whilest NSS2 (“sync”) transferred redo to the synchronised target – CDB2. Unlike v$dataguard_process wanted me to believe, it really wasn’t LGWR sending redo over the network.

In this little article I would like to show you how the standby databases CDB2 and CDB3 receive redo and how you can map this back to the primary database, closing the loop.

How does CDB2 receive redo?

First I’m looking at CDB2, which receives redo via synchronous mode. I should be able to narrow the communication down between primary and standby by referring to the LGWR and TT02 process IDs in the CLIENT_PID column on the standby. As a quick reminder, 14986 is the PID for LGWR, 15029 belongs to NSS2, and 15252 maps to TT02. Let’s try:

SQL> select db_unique_name,database_role from v$database;

DB_UNIQUE_NAME                 DATABASE_ROLE
------------------------------ ----------------
CDB2                           PHYSICAL STANDBY

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id
  2  from v$dataguard_process where client_pid in (14986,15029,15252);

NAME  PID        ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ---------- ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
rfs   5517       RFS sync                IDLE              14986 log writer               95     174565          0

So it would appear the process responsible for shipping redo to “SYNC” destinations is the log writer. Actually, the output of v$dataguard_process is quite interesting, which is why I’m adding it here for the sake of completeness:

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id
  2  from v$dataguard_process order by action;

NAME  PID        ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ---------- ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
rfs   5517       RFS sync                IDLE              14986 log writer               95     229446          0
rfs   5350       RFS archive             IDLE              15224 archive gap               0          0          1
rfs   5346       RFS ping                IDLE              15124 gap manager              95          0          0
rfs   5354       RFS archive             IDLE              15233 archive gap               0          0          1
MRP0  5348       managed recovery        IDLE                  0 none                      0          0          0
rfs   5352       RFS archive             IDLE              15240 archive gap               0          0          1
LGWR  5207       log writer              IDLE                  0 none                      0          0          0
TT01  5259       redo transport timer    IDLE                  0 none                      0          0          0
TT00  5255       gap manager             IDLE                  0 none                      0          0          0
ARC1  5263       archive redo            IDLE                  0 none                      0          0          0
ARC2  5265       archive redo            IDLE                  0 none                      0          0          0
ARC3  5267       archive redo            IDLE                  0 none                      0          0          0
TMON  5242       redo transport monitor  IDLE                  0 none                      0          0          0
ARC0  5257       archive local           IDLE                  0 none                      0          0          0

14 rows selected.

This view tells me that LGWR is attached to the RFS sync proces. But now I know better than that, and it is similar to what I saw on the primary. Looking a little closer, I can see that strictly speaking, the RFS process is connected to NSS2:

[root@server2 ~]# netstat -tunalp | egrep 'Active|Proto|5517'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp6       0      0 192.168.100.22:1521     192.168.100.21:15515    ESTABLISHED 5517/oracleCDB2   

I am repeating the values for the primary here so you don’t have to go back to the previous article:

[root@server1 ~]# ps -ef | egrep 'lgwr|lg0|nss'
oracle   14986     1  0 09:58 ?        00:01:19 ora_lgwr_CDB1
oracle   14990     1  0 09:58 ?        00:00:00 ora_lg00_CDB1
oracle   14994     1  0 09:58 ?        00:00:00 ora_lg01_CDB1
oracle   15029     1  0 09:58 ?        00:00:43 ora_nss2_CDB1

[root@server1 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '14986|14990|14994|15029'
tcp        0      0 server1.example.com:15515 server2.example.com:1521 ESTABLISHED oracle     16400768   15029/ora_nss2_CDB1

You will notice that port 15515 on server1 belongs to ora_nss2_CDB1.

Going back a little to v$dataguard_process, it seems a bit weird to see MRP0 as “idle” when the database is in managed recovery mode using real time apply. Trying something else I am querying v$managed_standby and voila: MRP0 is said to apply logs:

SQL> select process,pid,status,client_process,client_pid,sequence#,block# 
  2  from v$managed_standby order by status;

PROCESS   PID        STATUS       CLIENT_P CLIENT_PID                                SEQUENCE#     BLOCK#
--------- ---------- ------------ -------- ---------------------------------------- ---------- ----------
DGRD      5255       ALLOCATED    N/A      N/A                                               0          0
DGRD      5259       ALLOCATED    N/A      N/A                                               0          0
MRP0      5348       APPLYING_LOG N/A      N/A                                              95     246625
ARCH      5257       CLOSING      ARCH     5257                                             92       4096
ARCH      5263       CLOSING      ARCH     5263                                             93       2048
ARCH      5265       CLOSING      ARCH     5265                                             94     342016
ARCH      5267       CONNECTED    ARCH     5267                                              0          0
RFS       5350       IDLE         UNKNOWN  15224                                             0          0
RFS       5354       IDLE         UNKNOWN  15233                                             0          0
RFS       5352       IDLE         UNKNOWN  15240                                             0          0
RFS       5517       IDLE         LGWR     14986                                            95     246626
RFS       5346       IDLE         Archival 15124                                             0          0

12 rows selected.

I guess that’s true, as the system is in constant recovery using the standby logfiles.

And what about CDB3?

On the other hand, CDB3 – to which redo is shipped asynchronously – lists TT02 as it’s counterpart:

SQL> select db_unique_name,database_role from v$database;

DB_UNIQUE_NAME                 DATABASE_ROLE
------------------------------ ----------------
CDB3                           PHYSICAL STANDBY

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id 
  2   from v$dataguard_process order by action;

NAME  PID                      ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ------------------------ ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
rfs   14803                    RFS ping                IDLE              15124 gap manager              96          0          0
rfs   14809                    RFS archive             IDLE              15233 archive gap               0          0          0
rfs   14811                    RFS async               IDLE              15252 async ORL multi          96      34674          0
MRP0  11825                    managed recovery        IDLE                  0 none                      0          0          0
ARC0  11776                    archive local           IDLE                  0 none                      0          0          0
ARC2  11786                    archive redo            IDLE                  0 none                      0          0          0
TT00  11774                    gap manager             IDLE                  0 none                      0          0          0
ARC3  11788                    archive redo            IDLE                  0 none                      0          0          0
TMON  11706                    redo transport monitor  IDLE                  0 none                      0          0          0
LGWR  11676                    log writer              IDLE                  0 none                      0          0          0
ARC1  11784                    archive redo            IDLE                  0 none                      0          0          0
TT01  11778                    redo transport timer    IDLE                  0 none                      0          0          0

12 rows selected.

Unlike the case with CDB2, the local RFS process is indeed connecting to TT02 on server1:

[root@server2 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '^Active|^Proto|14811'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp6       0      0 server2.example.com:1521 server1.example.com:12936 ESTABLISHED oracle     4658198    14811/oracleCDB3    

… and on server1:

[root@server1 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '^Active|^Proto|12936'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address            State       User       Inode      PID/Program name    
tcp        0      0 server1.example.com:12936 server2.example.com:1521 ESTABLISHED oracle     15820538   15252/ora_tt02_CDB1 

This should be enough evidence, I rest my case :)

Summary

So I guess that answers my question: On my small VMs in the lab, NSSn is responsible for shipping redo to targets in “SYNC” mode. The redo transport server processes TTnn ship redo to destinations that are defined for asynchronous transport.

Little things worth knowing: scheduler jobs and Instance Caging

While updating material for a training class about Database Resource Management I briefly wondered if Instance Caging should apply for scheduler jobs as well. The obvious answer is “yes”, but I wanted to find proof. I hope I did in the end, and wanted to share how I got there.

The test environment

My system is based on the shiny new AMD Ryzen 7 1700X Eight-Core Processor, and it shows as 1s8c16t. I really like it! My 12.2.0.1 database named DEMO is patched to August 2017 – 12.2.0.1.170814 to be precise. It’s difficult to test resource management, and specifically Instance Caging, with a dual-core laptop, so this machine should help!

Test setup

Apart from the database workload I am soon going to launch the system is otherwise idle. I set cpu_count to 4 and made sure I had a resource manager plan active:

SQL> select name, value from v$parameter 
  2   where name in ('cpu_count','resource_manager_plan');

NAME                           VALUE                                             
------------------------------ ------------------------------------------------
cpu_count                      4                                                 
resource_manager_plan          SCHEDULER[0x4ABF]:DEFAULT_MAINTENANCE_PLAN        

For this case it doesn’t matter that my resource manager plan is associated with the maintenance window. All I needed was some resource manager plan. In production systems the situation is most often somewhat different and proper resource management is crucial.

The scheduler needs something to work with, and I opted for a stored procedure that needlessly burns CPU. Like this one:

SQL> create or replace procedure martin.burn_proc(
  2   pi_num_iterations number)
  3  as
  4   i number;
  5  begin
  6  for j in 1..pi_num_iterations loop
  7   i := dbms_random.random;
  8  end loop;
  9 end;
 10 /

Procedure created.

Now all I had to do was to create an anonymous block of PL/SQL scheduling a number of jobs. This little piece of code does just that:

SQL> !cat burn_sched.sql 
select to_char(sysdate,'dd.mm.yyyy hh24:mi:ss') now from dual;

show user

accept num_jobs number prompt 'how many jobs to schedule? '

prompt creating a number of scheduler jobs
declare
        i number;
begin
        for i in 1..&num_jobs loop
                dbms_scheduler.create_job(
                        job_name => 'martin.burn_proc_job' || i, 
                        job_type => 'PLSQL_BLOCK', 
                        job_action => 'begin burn_proc(1000000000); end;', 
                        start_date => systimestamp, 
                        enabled => true);
        end loop;
        dbms_lock.sleep(5);
end;
/

prompt checking for running jobs
select count(*) from dba_scheduler_running_jobs where owner = 'MARTIN';

Now it’s time to run the code!

SQL> @burn_sched

NOW
-------------------
05.11.2017 09:38:49

USER is "MARTIN"
how many jobs to schedule? 3
creating a number of scheduler jobs
old   4:        for i in 1..&num_jobs loop
new   4:        for i in 1..         3 loop

PL/SQL procedure successfully completed.

checking for running jobs

  COUNT(*)
----------
         3

With this first test I wanted to see what happens when keeping the number of jobs lower than the value I defined for cpu_count. Using Active Session History (ASH) as the source for performance data is probably the easiest way to analyse what happened.

Careful: using Active Session History (ASH) requires a license!

Provided you are licensed to use ASH, a query such as the following provides valuable insights:

SQL> select count(*), event, session_state, session_type, username
  2  from v$active_session_history ash left join dba_users u on (ash.user_id = u.user_id)
  3  where module = 'DBMS_SCHEDULER'
  4  and sample_time > to_date('05.11.2017 09:38:49','dd.mm.yyyy hh24:mi:ss')
  5  group by event, session_state, session_type, username;

  COUNT(*) EVENT                SESSION SESSION_TY USERNAME
---------- -------------------- ------- ---------- ----------
        89                      ON CPU  FOREGROUND MARTIN

As you can see, none of the samples in ASH show any waits for CPU. Let’s increase the session count to a value that exceeds my cpu_count of 4:

SQL> @burn_sched

NOW
-------------------
05.11.2017 09:51:45

USER is "MARTIN"
how many jobs to schedule? 5
creating a number of scheduler jobs
old   4:        for i in 1..&num_jobs loop
new   4:        for i in 1..         5 loop

PL/SQL procedure successfully completed.

checking for running jobs

  COUNT(*)
----------
         5

After these 5 jobs completed, I checked ASH again to see if there was a difference:

SQL> select count(*), event, session_state, session_type, username
  2  from v$active_session_history ash left join dba_users u on (ash.user_id = u.user_id)
  3  where module = 'DBMS_SCHEDULER'
  4  and sample_time > to_date('05.11.2017 09:51:45','dd.mm.yyyy hh24:mi:ss')
  5  group by event, session_state, session_type, username;

  COUNT(*) EVENT                          SESSION SESSION_TY USERNAME
---------- ------------------------------ ------- ---------- ----------
       153                                ON CPU  FOREGROUND MARTIN
        66 resmgr:cpu quantum             WAITING FOREGROUND MARTIN

And indeed, there is a number of Resource Manager CPU waits! To me this is proof enough that scheduler jobs also fall into the category of workload that can be caged.