Martins Blog

Trying to explain complex things in simple terms

Duplicate from the standby instead from the primary in 12c

Posted by Martin Bach on April 2, 2014

This post is related to 12c and an active database duplication for a standby I did in my lab environment. I’d say although I first encountered it on 12c there is a chance you run into a similar situation with earlier releases too.

I would normally use ASM for all my databases to make my life easier but this time I had to be mindful of the available memory on the laptop-which at 8 GB-is not plenty. So I went with file system setup instead. After the initial preparations I was ready to launch the one-liner on the standby database:

RMAN> duplicate target database for standby from active database;

This worked away happily for a few moments only to come to an abrupt halt with the below error message. I have started the duplication process on the standby.

...
executing command: SET NEWNAME

executing command: SET NEWNAME

Starting backup at 01-APR-14
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile file number=00012 name=/u01/oradata/CDB1/valid/sysaux01.dbf
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 04/01/2014 20:34:42
ORA-17628: Oracle error 19505 returned by remote Oracle server
ORA-19505: failed to identify file ""
channel ORA_DISK_1 disabled, job failed on it will be run on another channel
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 04/01/2014 20:34:42
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script

RMAN-03009: failure of backup command on ORA_DISK_1 channel at 04/01/2014 20:34:42
ORA-17628: Oracle error 19505 returned by remote Oracle server
ORA-19505: failed to identify file ""

Huh? File 12 is part of a PDB named VALID. It was definitely available on the file system on the source, and I even had a backup on disk for it… I first thought it had to do with the PDBs all closed on the source datbase but that was not the case. After opening the PDBs read-write I still had the same error. Strange. I thought about debugging and tracing RMAN but then decided to first try the duplication from the primary. Which was a good idea, because the primary actually gave me a more meaningful error message:

Starting restore at 19.03.2014 10:17:08
using channel ORA_AUX_DISK_1

channel ORA_AUX_DISK_1: starting datafile backup set restore
channel ORA_AUX_DISK_1: using network backup set from service cdb1
channel ORA_AUX_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_AUX_DISK_1: restoring datafile 00004 to /u01/oradata/STDBY1/undotbs01.dbf
channel ORA_AUX_DISK_1: restore complete, elapsed time: 00:00:25
channel ORA_AUX_DISK_1: starting datafile backup set restore
channel ORA_AUX_DISK_1: using network backup set from service cdb1
channel ORA_AUX_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_AUX_DISK_1: restoring datafile 00005 to /u01/oradata/STDBY1/pdbseed/system01.dbf
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 03/19/2014 10:17:26
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
ORA-19849: error while reading backup piece from service cdb1
ORA-19504: failed to create file "/u01/oradata/STDBY1/pdbseed/system01.dbf"
ORA-27040: file create error, unable to create file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 1

So I forgot to create the needed directories on the standby. For consistency I created a subdirectory per PDB on the primary, which I forgot on the standby host. You can see this here:

RMAN> report schema;

Report of database schema for database with db_unique_name CDB1

List of Permanent Datafiles
===========================
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    790      SYSTEM               ***     /u01/oradata/CDB1/system01.dbf
3    750      SYSAUX               ***     /u01/oradata/CDB1/sysaux01.dbf
4    90       UNDOTBS1             ***     /u01/oradata/CDB1/undotbs01.dbf
5    250      PDB$SEED:SYSTEM      ***     /u01/oradata/CDB1/pdbseed/system01.dbf
6    5        USERS                ***     /u01/oradata/CDB1/users01.dbf
7    590      PDB$SEED:SYSAUX      ***     /u01/oradata/CDB1/pdbseed/sysaux01.dbf
11   270      VALID:SYSTEM         ***     /u01/oradata/CDB1/valid/system01.dbf
12   610      VALID:SYSAUX         ***     /u01/oradata/CDB1/valid/sysaux01.dbf
13   260      RCAT:SYSTEM          ***     /u01/oradata/CDB1/rcat/system01.dbf
14   610      RCAT:SYSAUX          ***     /u01/oradata/CDB1/rcat/sysaux01.dbf
15   100      RCAT:RMAN_CATALOG    ***     /u01/oradata/CDB1/rcat/rman_catalog01.dbf
16   50       VALID:USERS          ***     /u01/oradata/CDB1/valid/users01.dbf
17   250      DGTEST:SYSTEM        ***     /u01/oradata/CDB1/dgtest/system01.dbf
18   590      DGTEST:SYSAUX        ***     /u01/oradata/CDB1/dgtest/sysaux01.dbf

List of Temporary Files
=======================
File Size(MB) Tablespace           Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1    60       TEMP                 32767       /u01/oradata/CDB1/temp01.dbf
2    20       PDB$SEED:TEMP        32767       /u01/oradata/CDB1/pdbseed/pdbseed_temp01.dbf
4    20       VALID:TEMP           32767       /u01/oradata/CDB1/valid/pdbseed_temp01.dbf
5    20       RCAT:TEMP            32767       /u01/oradata/CDB1/rcat/pdbseed_temp01.dbf
6    20       DGTEST:TEMP          32767       /u01/oradata/CDB1/dgtest/dgtest_temp01.dbf

All right, after creating the directories I retried the duplication and it worked:

...
datafile 4 switched to datafile copy
input datafile copy RECID=5 STAMP=843771196 file name=/u01/oradata/STDBY1/undotbs01.dbf

datafile 5 switched to datafile copy
input datafile copy RECID=6 STAMP=843771198 file name=/u01/oradata/STDBY1/pdbseed/system01.dbf
datafile 6 switched to datafile copy
input datafile copy RECID=7 STAMP=843771198 file name=/u01/oradata/STDBY1/users01.dbf
datafile 7 switched to datafile copy
input datafile copy RECID=8 STAMP=843771199 file name=/u01/oradata/STDBY1/pdbseed/sysaux01.dbf
datafile 11 switched to datafile copy
input datafile copy RECID=9 STAMP=843771199 file name=/u01/oradata/STDBY1/valid/system01.dbf
datafile 12 switched to datafile copy
input datafile copy RECID=10 STAMP=843771199 file name=/u01/oradata/STDBY1/valid/sysaux01.dbf
datafile 13 switched to datafile copy
input datafile copy RECID=11 STAMP=843771199 file name=/u01/oradata/STDBY1/rcat/system01.dbf
datafile 14 switched to datafile copy
input datafile copy RECID=12 STAMP=843771200 file name=/u01/oradata/STDBY1/rcat/sysaux01.dbf
datafile 15 switched to datafile copy
input datafile copy RECID=13 STAMP=843771200 file name=/u01/oradata/STDBY1/rcat/rman_catalog01.dbf
datafile 16 switched to datafile copy
input datafile copy RECID=14 STAMP=843771200 file name=/u01/oradata/STDBY1/valid/users01.dbf
datafile 17 switched to datafile copy
input datafile copy RECID=15 STAMP=843771200 file name=/u01/oradata/STDBY1/dgtest/system01.dbf
datafile 18 switched to datafile copy
input datafile copy RECID=16 STAMP=843771200 file name=/u01/oradata/STDBY1/dgtest/sysaux01.dbf
Finished Duplicate Db at 19.03.2014 10:28:42

Summary

Although not a big problem and probably known by everyone besides me the error message on the standby shouldn’t be of the scary type ‘can’t identify datafile “”‘ followed by the process aborting. Instead something more meaningful would have been nice.

Posted in 12c Release 1, Linux, RMAN | Tagged: , | Leave a Comment »

Recovering a standby over the network in 12c

Posted by Martin Bach on March 17, 2014

Another one of the cool but underrated features in 12c is the possibility to recover a physical standby over the network with one line in RMAN.

Why do you need to perform this activity? Assume someone really clever created a segment “nologging” and the database was not in force logging mode. This operation cannot be replicated by redo apply on the standby, and you are bound to have a problem. Or, in my case, I had the standby shut down in my lab environment (intentionally) and created a few PDBs on my primary. For some reason I lost an archived redo log. This would of course not happen in a production environment, but my lab VM is limited when it comes to space and I may have moved my backup to a USB disk that I didn’t bring along.

So a long story short: you either recreate the standby-which is a rather blunt-force approach to the problem-or recover the standby from the SCN where the problem originated. I was hoping the new 12c recover database from service command in RMAN could help. It did, but not as I expected at first.

The problem with my lab

My environment is Oracle Linux 6.4 x86-64 on Virtual Box with 4 GB RAM. Oracle is 12.1.0.1.2. CDB1 is primary, CDB2 is my standby database. Both are on the same VM (I only have 8 GB total memory). CDB2 complained about a missing archived redo log:

SQL> select * from v$archive_gap;

   THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#     CON_ID
---------- ------------- -------------- ----------
         1            27             27          1

Read the rest of this entry »

Posted in 12c Release 1, Linux, RMAN | 4 Comments »

Interesting GNS anomaly in 12.1.0.1.2 RAC

Posted by Martin Bach on March 13, 2014

I was encountering an interesting anomaly with my 12.1.0.1.2 RAC cluster based on Grid Naming System. I have written about the setup here.

No need to get back to the post-instead let me give you some background if you are not familiar with the Grid Naming System in RAC. Most RAC installations use a Domain Name Server (DNS) to resolve names to IP addresses. In the case of GNS the name resolution is delegated from the main DNS server to a so-called subdomain. In my case my DNS server (bind version 9) is configured on aux.example.com (192.168.100.2), and it delegates name resolution for *.gns.example.com (my cluster) to the Grid Naming System. The GNS address in use is 192.168.100.37 and must be defined in the corporate DNS. The required IP addresses have to be supplied by a DHCP server in the same range as the public network. My public network is on 192.168.100/24, the private network is on 192.168.101/24 and 192.168.102/24. Which component resolves what? Here is a summary:

  • The GNS VIP is resolved by DNS. It’s not in the subdomain according to the documentation. I am using gns-vip.example.com (192.168.100.37) for it.
  • The node public IP is resolved via GNS. Oracle uses the $(hostname) to assign the name. The IP address is fixed
  • The node VIPs are resolved via GNS, the assigned name is $(hostname)-vip and the addresses are assigned by DHCP
  • The private interconnect interface is resolved by GNS, usually as $(hostname)-priv. Addresses are either static or assigned by DHCP
  • The SCAN VIPs are resolved via GNS and assigned by DHCP. And this is exactly where my problem was.

The Problem

All over sudden I had connection problems with my tomcat server connecting to the database. The address I used in tnsnames ora was rac12scan.gns.example.com and hadn’t changed at all. It simply didn’t point to a valid IP half of the time. When I pinged the SCAN I was surprised to see that it listed adresses that weren’t in the output of “srvctl config scan”.

And sure enough, a nslookup against the SCAN resulted in the output of 7 (!) addresses. There was obviously a discrepancy between what Oracle thought the SCAN was (3 IP addresses) and what GNS reported. I tried various cures, including restarting the DNS and DHCP servers (even moving the lease file on the DHCP server to a temporary location), and variations of restarting GNS and the whole cluster (crsctl stop clsuter -all; crsctl start cluster -all). Nothing helped.

Troubleshooting

As you’d expect the name resolution only works with GNS started. If you try to “nslookup rac12scan.gns.example.com” while GNS is down, you get the NXDOMAIN result code which does not really help. The output of “srvctl status gns -list and -detail” is useful in troubleshooting the problem. I also ran “cluvfy comp gns” but that came back clean. Nevertheless, the system instisted that the SCAN was made up of 7 IP addresses, and only 4 of them had SCAN VIPs. No surprise I can’t connect half the time. Now the question is of course: why does the GNS daemon grab so many IP addresses? So I looked at the gnsd.log in $GRID_HOME/log/$(hostname -s)/gns to see what happens. Here is an excerpt from the log file:

2014-03-11 16:37:30.305: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01305:) #0 of 3: OCR error 4 (PROC-4: The cluster registry key to be operated on does not exist.) is not expected 32 - no retry will be done.
2014-03-11 16:37:30.305: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01307:) Pass #0: open of "SYSTEM.GNS.SUBDOMAIN.gns|dexample|dcom.rac12gns-scan1-vip.SRV" failed with error 4 Last: PROC-4: The cluster registry key to be operated on does not exist. (4) - unable to retry.
2014-03-11 16:37:30.306: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01305:) #0 of 3: OCR error 4 (PROC-4: The cluster registry key to be operated on does not exist.) is not expected 32 - no retry will be done.
2014-03-11 16:37:30.306: [     GNS][1009862208]main::clsgnocrOpenKeyInternal: (:CLSGN01307:) Pass #0: open of "SYSTEM.GNS.SUBDOMAIN.gns|dexample|dcom.rac12gns-scan1-vip.TXT" failed with error 4 Last: PROC-4: The cluster registry key to be operated on does not exist. (4) - unable to retry.
2014-03-11 16:37:30.306: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12node1-vip" Type: A 192.168.100.44 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.308: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12node2-vip" Type: A 192.168.100.39 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.308: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.34 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.36 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.35 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.40 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.42 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.41 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12scan" Type: A 192.168.100.31 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.309: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan2-vip" Type: A 192.168.100.41 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan2-vip" Type: A 192.168.100.34 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan3-vip" Type: A 192.168.100.40 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan3-vip" Type: A 192.168.100.38 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [     GNS][1009862208]main::clsgndhaRegisterAddresses: Name: "rac12gns-scan1-vip" Type: A 192.168.100.31 Unique: TRUE Flags: ALLOCATED
2014-03-11 16:37:30.310: [   CLSNS][1009862208]Resolve::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.310: [   CLSNS][1009862208]Resolve::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.311: [   CLSNS][1009862208]Resolver #0::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.311: [   CLSNS][1009862208]Resolver #0::clsns_SetTraceLevel:trace level set to 1.
2014-03-11 16:37:30.311: [   CLSNS][1009862208]Resolver #1::clsns_SetTraceLevel:trace level set to 1.

You can see the address assignment here, and this corresponds to the DHCP leases I saw on the DHCP server host’s /var/log/messages file. Notice that rac12scan grabs 7 IP addresses. This is surprising, the documentation normally states just 3 IP addresses for use with the SCAN. As you can further see the SCAN VIPs resolve only to a subset of these. 192.168.100.36 for example does not have a corresponding SCAN-VIP like some others too. Also, rac12gns-scan2-vip and rac12gns-scan3-vip appear twice. So if rac12scan now resolves to 192.168.100.36 you will get a TNS Error in the form “Destination host does not exist”. Annoying! But why does it do so? The mesages before (PROC-4) gave me a clue.

The information about GNS VIPs seems to be stored in the OCR. An ocrdump confirmed the findings. The question remains: why do Oracle use more than 3 IP addresses for the SCAN? I can only speculate that the DHCP addresses obtained are added to the OCR, and the existing ones don’t seem to be cleared out or updated correctly.

In an effort to solve the problem I dropped the SCAN listeners and the SCAN VIPS and recreated them. But this seemed to confuse the cluster even more, and the name resolution didn’t work. I tried both the subdomain delegation (nslookup rac12scan.gns.example.com) and querying GNS directly (nslookup rac12scan.gns.example.com – 192.168.100.37), both did not resolve anything. A restart of Clusterware didn’t change the situation so I decided to reboot the cluster nodes hoping to bring everything back to normal (yes I should know better, this isn’t Windows). After some waiting until the servers come back up there was no change. Now the system was so confused that it didn’t come up with a SCAN at all. Trying to restore the OCR failed, and I couldn’t even run a rootcrs.pl -deconfig -force in preparation for a reinstallation of Clusterware. So I trashed the system. I don’t believe there is a time for GNS in my lab after this experience.

Thanks to Marcin Przepiorowski (@pioro) who helped during the investigation.

References:

Grid Infrastructure Installation Guide 12c:

  • 1.1.3 Oracle Grid Infrastructure Network Checklist
  • 5.5.1 About Oracle Grid Infrastructure Name Resolution Options
  • 5.11 Grid Naming Service Standard Cluster Configuration Example

Posted in 12c Release 1, KVM, Linux | 1 Comment »

Runtime Load Balancing Advisory in RAC 12c-addendum

Posted by Martin Bach on February 19, 2014

A reader asked an interesting question yesterday with regards to the previous post on the subject: where did you get your service metrics from when you queried v$servicemetric-PDB or CDB$ROOT?

I queried the PDB, but this morning repeated the test to make sure the results are consistent, and they are. This is definitely something you’d hope for: you should not have different results in the same v$-view depending on the container you execute your query in for a given CON_ID.

During testing I noticed something interesting though. I queried gv$servicemetric but did not limit the result to the service I wanted to test with (FCFSRV). Here is the query against gv$servicemetric while the system was idle.

select inst_id,begin_time,end_time,service_name,cpupercall,callspersec,goodness,delta,con_id
from gv$servicemetric
where con_id = 3 and service_name  in ('FCFSRV','demopdb')
order by service_name, inst_id;

   INST_ID BEGIN_TIME          END_TIME            SERVICE_NAME         CPUPERCALL CALLSPERSEC   GOODNESS      DELTA     CON_ID
---------- ------------------- ------------------- -------------------- ---------- ----------- ---------- ---------- ----------
         1 19.02.2014 10:07:16 19.02.2014 10:07:21 FCFSRV                        0           0        100        100          3
         1 19.02.2014 10:05:57 19.02.2014 10:06:56 FCFSRV                        0           0        100        100          3
         2 19.02.2014 10:05:55 19.02.2014 10:06:54 FCFSRV                        0           0        100        100          3
         2 19.02.2014 10:07:19 19.02.2014 10:07:24 FCFSRV                        0           0        100        100          3
         1 19.02.2014 10:07:16 19.02.2014 10:07:21 demopdb                       0           0          0          1          3
         1 19.02.2014 10:05:57 19.02.2014 10:06:56 demopdb                       0           0          0          1          3
         2 19.02.2014 10:07:19 19.02.2014 10:07:24 demopdb                       0           0          0          1          3
         2 19.02.2014 10:05:55 19.02.2014 10:06:54 demopdb                       0           0          0          1          3

8 rows selected.

Read the rest of this entry »

Posted in 12c Release 1, Linux, RAC | Leave a Comment »

Runtime Load Balancing Advisory in RAC 12c

Posted by Martin Bach on February 18, 2014

This is a follow-up on yesterday’s post about services in the new 12c database architecture. After having worked out everything I needed to know about TAF and RAC 12c in CDBs I wanted to check how FCF works with PDBs today. While investigating I found out that the Runtime Load Balancing Advisory does not seem to work as expected in some cases. But I’m getting ahead of myself. First of all, here is my test case:

  • Oracle Linux 6.4 x86-64
  • Grid Infrastructure 12.1.0.1.2, i.e. January 2014 PSU applied
  • RDBMS 12.1.0.1.2, likewise patched with the January PSU
  • A CDB with just 1 PDB for this purpose, named DEMOPDB
  • Service FCFSRV is used

Read the rest of this entry »

Posted in 12c Release 1, Linux, RAC | 2 Comments »

RAC and Pluggable Databases

Posted by Martin Bach on February 17, 2014

In preparation of the OUGN Spring Seminar and to finally fulfill at least a part of my promise from July I was getting ready to research RAC, PDBs and services for my demos. It turned out to become a lot more interesting than I first assumed.

RAC and Multi-Tenancy

So the first attempt to really look at how this works has started with my 2 node cluster where I created a RAC database: RAC12C, administrator managed with instance RAC12C1 and RAC12C2. The database is registered in Clusterware. Clusterware and RDBMS are patched to the January PSU, i.e. 12.1.0.1.2.

The second step was to create a PDB for testing-it’s named DEMOPDB and available on both my instances by design. By the way-PDBs do not start automatically, it is your responsibility to start them when the database starts. I used a startup-trigger for this in the Consolidation Book, have a look at it to see the example. There are other ways available as I found out.

As soon as the PDBs is opened (in all Oracle deployment types, not limited to a RAC instance), a new service with the same name as the PDB is automatically started. As it turned out, using that service is the most reliable way to connect to the PDB:

SQL> connect user/password@single-client-access-name/pdbname

This works really well. Substitute your hostname with the SCAN for single instance. Now if you would like to implement some more interesting features (TAF/FAN+FCF/Application Continuity) you could create an additional service. The srvctl syntax has changed, Oracle now uses long parameter names (-service instead of -s), which doesn’t really respect the UNIX/GNU way of naming parameters (short parameter: single dash, long parameter name: double-dash) but that’s how it is.

Read the rest of this entry »

Posted in 12c Release 1, Linux, RAC | Tagged: , , , | Leave a Comment »

Public appearances first half of 2014

Posted by Martin Bach on February 11, 2014

I had the great fortune to have had many of my abstract accepted for upcoming conferences and other public appearances, the first ones for this year are these:

OUG Ireland

The first conference I’ll be attending is in Dublin, for OUG Ireland 2014 on March 11.I’m going to help out with RAC Attack and also present about Technologies for Developing Highly Available Applications in RAC 12c at 14:50 for about 45 minutes. What can you expect? Here is the official abstract, I am looking forward to the presentation and the live demos.

Oracle offers a wealth of technologies to make your application more resilient to instance failure. This talk presents an overview of what is commonly considered in application development based on the Java programming language. You will learn about these technologies from a DBA’s point of view. Options will include TAF as a baseline followed by FCF and finally Application Continuity (incl. demos)

Exadata Workshop

At Enkitec we regularly organise Exadata workshops for those who are either interested in the technology or using it already but want to get more out of it. These workshops are scheduled quarterly, and I have already held two of these in London. The next one for Europe will take place on March 27 (Thursday), again in Oracle’s London City Office. If you are interested in the Exadata platform the workshop is an opportunity to discuss your deployment with other users or just to bounce ideas off other people then by all means come along (it’s free of charge)! The official link to this workshop is here:

http://www.enkitec.com/education/sessions/exadata_workshop_london_27-mar-2014

I have updated the material to include the latest Oracle 12c cell software features as well as support for the new RDBMS architecture-Container Databases-and will update delegates on how this works and also how Exadata supports Oracle’s consolidation platform. There are many cool new features in 12c worth exploring and knowing about.

OUGN Spring Seminar

Next up is the Norwegian User Group’s Spring Meeting on April 3-4 where I have managed to secure two slots:

  • Oracle 12c features that didn’t make the marketing top 10
  • Advanced RAC programming features

The agenda is online and can be found here: http://ougnvarseminar2014.sched.org/

There are far too many great speakers to just link to my talks, feel free to browse (and register :)

For me the meeting starts on Wednesday 2nd when I’m flying to Oslo. This is without a shadow of a doubt a very good conference, and I can only recommend it-exactly like the others on this page. It is very well organised and definitely worth going. There are very few other conferences where you can mingle with the presenters and chat about all sorts of things in the same way as there. I have attended twice already and enjoyed it every time.

OUGN14_Speaking_200px

Enkitec E4

Enkitec’s Extreme Exadata Expo is one of the conferences with the highest technical content I can imagine. I attended last year and was really impressed by the speakers and the content. The impact on “Big Data” is clearly visible in the agenda and like last year you can find Exadata and “Big Data” talks.

e4_2014_email_sig

Enkitec Extreme Exadata Expo

I am quite chuffed I made it on the agenda of the event! I’m going to present “Think Exa” together with my colleague Frits Hoogland.

In this presentation we are going to talk about setting your mindset on Exadata to overcome pre-ASM and pre-RAC habits. Unlearning some things we have been taught (or doing) for many years take its time, and we are hoping to give the audience an overview of what can be achieved with the platform.

At the same time that I am happy that I secured a speaker’s slot I am even more excited to attend the conference. Looking at the list of speakers who already confirmed their attendance it is going to be a blast. I am looking forward to seeing my colleagues from the United States but also to catch up with new and old friends.

Looking forward to seeing you at one of these events!

Posted in Public Appearances | Leave a Comment »

Data Guard transport lag in OEM 12c

Posted by Martin Bach on January 30, 2014

I have come across this phenomenon a couple of times now so I thought it was worth writing up.

Consider a scenario where you get an alert because your standby database has an apply lag. The alert is generated by OEM and when you log in and check-it has indeed an apply lag. Even worse, the apply lag increases with every refresh of the page! I tagged this as an 11.2 problem but it’s definitely not related to that version.

Here is a screenshot of this misery:

 Lag in OEM

Now there are of course a number of possible causes:

  • There is a lag
  • You are not using Real Time Apply

The first one is easy to check: look at the redo generation rate on the primary database to see if it’s any different. Maybe you are currently loading lots of data? Maybe a batch job has been initiated that goes over a lot of data… the possibilities are nearly endless.

Another, more subtle interpretation could be that you are not using Real Time Apply. How can you check? In the broker command line interface for example:

DGMGRL> show configuration

Configuration - test

  Protection Mode: MaxPerformance
  Databases:
    pri - Primary database
      Warning: ORA-16789: standby redo logs not configured

    sby - Physical standby database
      Warning: ORA-16789: standby redo logs not configured

Fast-Start Failover: DISABLED

Configuration Status:
WARNING

The warnings about missing standby redo logs show that you cannot possibly use Real Time Apply (it needs standby redo logs). The other option is in the database itself:

SQL> select dest_id,status,database_mode,recovery_mode
  2  from v$archive_dest_status
  3  where status <> 'INACTIVE';

   DEST_ID STATUS    DATABASE_MODE   RECOVERY_MODE
---------- --------- --------------- -----------------------
         1 VALID     MOUNTED-STANDBY MANAGED
        32 VALID     UNKNOWN         IDLE

Did you notice dest_id of 32? That’s a bit of an unusual one, more on that later (since you can only set log_archive_dest_x where x is {1,31}).

So indeed we have managed recovery active, but not using Real Time Apply. This is expressed in the database status:

DGMGRL> show database verbose sby

Database - sby

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   28 seconds
  Apply Lag:       28 seconds
  Real Time Query: OFF
  Instance(s):
    sby

A few moments later when you query the database again the lag has increased:

DGMGRL> show database verbose sby

Database - sby

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   3 minutes 22 seconds
  Apply Lag:       3 minutes 22 seconds
  Real Time Query: OFF
  Instance(s):
    sby

This is to be expected-the primary is still happily processing user requests. The cure is to add standby redo logs, as suggested in so many places and described in the Data Guard documentation. After the successful addition of SRLs the lag should disappear. A restart of managed recovery using the broker will show something along these lines on the standby:

2014-01-30 14:35:18.353000 +00:00
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE  THROUGH ALL SWITCHOVER DISCONNECT  USING CURRENT LOGFILE
Attempt to start background Managed Standby Recovery process (sby)
MRP0 started with pid=24, OS id=4854
MRP0: Background Managed Standby Recovery process started (sby)
2014-01-30 14:35:23.406000 +00:00
 started logmerger process
Managed Standby Recovery starting Real Time Apply
...
2014-01-30 14:37:12.595000 +00:00
Media Recovery Waiting for thread 1 sequence 20 (in transit)
2014-01-30 14:37:13.691000 +00:00
Recovery of Online Redo Log: Thread 1 Group 5 Seq 20 Reading mem 0
  Mem# 0: +DATA/sby/onlinelog/group_5.266.838218819

Two important bits of information are shown here: Managed Standby Recovery starting Real Time Apply and the fact that it is using the standby redo log. Sure enough, after the database is in sync with its primary and uses the log, the lag is gone:

DGMGRL> show database verbose sby

Database - sby

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   0 seconds
  Apply Lag:       0 seconds
  Real Time Query: OFF
  Instance(s):
    sby

And also in the OEM view:

OEM-lag-02

Slight Variation

I have also seen this problem in OEM where the transport lag was near 0 and therefore hardly visible due to the scale of the graph. The apply lag nevertheless resulted from the primary working and the current log hasn’t shipped to the standby-obviously before the implementation of standby redo logs. You saw a spike mounting in the OEM view until the next log switch on the primary when the apply lag dropped to 0 for a brief moment before increasing again.

Summary

Real Time Apply is a very very useful feature, especially when used together with the maximum availability protection mode. The real risk of not using standby redo logs – and implicitly no RT Apply – is that you lose data since the current online redo log on the primary has not been copied across. If you need to activate your standby you will be some transactions short of the primary. The larger the online redo log, the larger the gap.

Posted in 11g Release 2, Cloud Control | Tagged: , | 8 Comments »

Massive tablespace fragmentation on LMT with ASSM

Posted by Martin Bach on January 27, 2014

I have been asked to investigate another interesting problem worth writing about. It’s worth mentioning here because it deals with a problem I believed to have long since been solved: tablespace fragmentation. However, in this case it was a locally managed tablespace (LMT) with Automatic Segment Space Management (ASSM) enabled. It should be difficult to have fragmentation on one of these, but as you will see it is not impossible.

So the story started innocently enough with an ORA-01653 while shuffling around subpartitions to another tablespace:

01653, 00000, "unable to extend table %s.%s by %s in tablespace %s"
// *Cause:  Failed to allocate an extent of the required number of blocks for
//          a table segment in the tablespace indicated.
// *Action: Use ALTER TABLESPACE ADD DATAFILE statement to add one or more
//          files to the tablespace indicated.

Well except that there was plenty of free space available, the sum(bytes) in dba_free_tablespaces for the tablespace showed in excess of 20 GB. So it could only be: fragmentation. And indeed I quickly found out that the largest contiguous free block was 8 MB. Which should really have been enough for most use cases… After a little digging around I noticed that all subpartitions on the tablespace had at least 9 extents, some more. The sum of bytes per subpartition ever so slightly exceeded 8MB which was the cause for the fragmentation. Here is an example:

SQL> select segment_name,partition_name,bytes
  2  from dba_extents
  3  where segment_name = 'T1_SUBPART'
  4  and partition_name = 'SYS_SUBP26737';

SEGMENT_NAME                   PARTITION_NAME                      BYTES
------------------------------ ------------------------------ ----------
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                      131072
T1_SUBPART                     SYS_SUBP26737                      131072
T1_SUBPART                     SYS_SUBP26737                      131072
T1_SUBPART                     SYS_SUBP26737                      131072
T1_SUBPART                     SYS_SUBP26737                      196608
T1_SUBPART                     SYS_SUBP26737                     1048576
T1_SUBPART                     SYS_SUBP26737                     1048576
                                                              ----------
sum                                                              9109504

13 rows selected.

Some of the extents are nicely aligned along 1MB boundaries, but there are a few that are 128kb in size. Multiply this by a few thousand and you get the picture. It’s not good if you have 20,000 entries in DBA_FREE_SPACE for a given tablespace!

The error message was thrown during the execution of an anonymous PL/SQL block that moved lots of subpartitions from one tablespace to another. There were these already mentioned roughly few thousand subpartitions to be moved plus a few unpartitioned tables. The process aborted halfway. The following is more or less a transcript of my troubleshooting efforts.

The test case

To reproduce the problem in our lab I created a table with lots of sub partitions, using the following SQL statement. It doesn’t matter that the subpartitions are more or less empty in this case.

CREATE TABLE t1_subpart
  (
    id, t_pad, date_Created, date_completed, state, spcol
  )
  partition BY range (id) interval (1000)
  subpartition BY list (spcol) subpartition template
  (
    subpartition sp1 VALUES ('0'),
    subpartition sp2 VALUES ('1'),
    subpartition sp3 VALUES ('2'),
    subpartition sp4 VALUES ('3'),
    subpartition spdef VALUES (DEFAULT)
  )
  (
    partition p_man VALUES less than (10000)
  ) AS
WITH v1 AS
  (SELECT rownum n FROM dual CONNECT BY level <= 10000
  )
SELECT mod(rownum,1000000) id,
  rpad(rownum,100) t_pad,
  TRUNC(sysdate) - 180 + TRUNC((rownum-1)/3)/86400 date_created,
  TRUNC(sysdate) - 180 + TRUNC((rownum-1)/3)/86400 + dbms_random.value(1800, 86400)/86400 date_completed,
  CASE
    WHEN mod(rownum,100000) = 0
    THEN CAST('RARE' AS VARCHAR2(12))
    WHEN mod(rownum,10000) = 0
    THEN CAST('FAIRLY RARE' AS VARCHAR2(12))
    WHEN mod(rownum,1000) = 0
    THEN CAST('NOT RARE' AS VARCHAR2(12))
    WHEN mod(rownum,100) = 0
    THEN CAST('COMMON' AS   VARCHAR2(12))
    ELSE CAST('THE REST' AS VARCHAR2(12))
  END state,
  TO_CHAR(mod(rownum, 4)) spcol
FROM v1,
  v1
WHERE rownum <= 1e6; 

(If this looks familiar to you then you probably know this presentation by Jonathan Lewis)

A couple of things to remember on this 11.2.0.3 database: Oracle introduced large extents for partitions, as explained in “Initial Extent Size of a Partition Changed to 8MB from 64KB After Upgrade to 11.2.0.2 or Later (Doc ID 1295484.1)”. So having empty partitions is now a huge waste of space!

But back to the problem: is there anything that would lead you to suspect fragmentation?

SQL> select max(blocks) from dba_segments where segment_name = 'T1_SUBPART';

MAX(BLOCKS)
-----------
       1024

Elapsed: 00:00:00.15

No, all subpartitions are 8 MB, most of them 1x8M, < 15 different:

select s.segment_name,s.partition_name, s.blocks, s.extents
from user_segments s
where s.segment_name = 'T1_SUBPART'
and extents <> 1
order by extents desc;

SEGMENT_NAME                   PARTITION_NAME                     BLOCKS    EXTENTS
------------------------------ ------------------------------ ---------- ----------
T1_SUBPART                     SYS_SUBP29210                        1024          8
T1_SUBPART                     SYS_SUBP32032                        1024          5
T1_SUBPART                     SYS_SUBP29180                        1024          5
T1_SUBPART                     SYS_SUBP28738                        1024          4
T1_SUBPART                     SYS_SUBP32624                        1024          4
T1_SUBPART                     SYS_SUBP26953                        1024          4
T1_SUBPART                     SYS_SUBP30400                        1024          4
T1_SUBPART                     SYS_SUBP30996                        1024          4
T1_SUBPART                     SYS_SUBP31591                        1024          4
T1_SUBPART                     SYS_SUBP28143                        1024          4
T1_SUBPART                     SYS_SUBP27548                        1024          4
T1_SUBPART                     SYS_SUBP29805                        1024          4

Most of the subpartitions have 1×8 MB extents, very few have more, but all sum up to 8MB:

SQL> select count(extents),extents from user_segments s
   where s.segment_name = 'T1_SUBPART'
   group by extents

COUNT(EXTENTS)    EXTENTS
-------------- ----------
          4942          1
             9          4
             2          5
             1          8

Elapsed: 00:00:00.66
SQL>

SQL> select distinct bytes from user_segments where segment_name = 'T1_SUBPART';

     BYTES
----------
   8388608

Next I am trying to reproduce the problem as closely as possible. Here, HCC compression has been enabled by the customer on the table, after which the partitions have been moved to the destination tablespace. Note that this problem has nothing to do with HCC. I have since tried to reproduce this case outside of Exadata with the same result.

SQL> alter table t1_subpart compress for query high;

Table altered.

Next I’ll show you DBA_FREE_SPACE for the tablepace:

SQL> select * from dba_free_space
  2* where tablespace_name = 'DESTINATION';

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
DESTINATION                            10        128 1072693248     130944           10
DESTINATION                            11        128 1072693248     130944           11

Elapsed: 00:00:00.01

So now I’m going to move all of the subpartitions. At first I couldn’t reproduce the fragmentation at all when I moved these serially-everything stayed just as it was, some of the subpartitions which consisted of more than 1 extent actually were coalesced into just 1. I didn’t pay attention to the “parallel” keyword as I thought that was just a means of speeding things up. When after the 3rd time I had no luck reproducing the fragmentation I actually tried with parallel, and here is how it went. First the anonymous PL/SQL block:

SQL> !cat movemove.sql
begin
 for i in (select subpartition_name from user_tab_subpartitions
           where table_name = 'T1_SUBPART' and tablespace_name <> 'DESTINATION')
 loop
  begin
   execute immediate 'alter table t1_subpart move subpartition ' || i.subpartition_name || ' tablespace DESTINATION parallel';
  exception
   when others then
    dbms_output.put_line('could not move ' || i.subpartition_name || ' due to ' || sqlerrm(sqlcode));
  end;
 end loop;
end;
/

This takes a long time, and the quality of the script is not production-worthy! In this case the anonymous block completed without issues, but that was just 1 table and not the same volume as the real system where this happened. Look at the extent distribution now:

SQL> select count(extents),extents from user_segments s
  2  where s.segment_name = 'T1_SUBPART'
  3  group by extents;

COUNT(EXTENTS)    EXTENTS
-------------- ----------
           634          1
          3959         13
           356          5
             4         48
             1          8

Also consider the impact on DBA_FREE_SPACE:

SQL> select count(1) from dba_free_Space where tablespace_name = 'DESTINATION';

  COUNT(1)
----------
     19796

Now looking at a random subpartition shows the dilemma:

SQL> select PARTITION_NAME, SEGMENT_TYPE, TABLESPACE_NAME, EXTENT_ID, BYTES, BLOCKS
  2  from dba_extents where segment_name = 'T1_SUBPART'
  3  and partition_name = 'SYS_SUBP26737'
  4  order by extent_id;

PARTITION_NAME                 SEGMENT_TYPE       TABLESPACE_NAME                 EXTENT_ID      BYTES     BLOCKS
------------------------------ ------------------ ------------------------------ ---------- ---------- ----------
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             0     131072         16
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             1     131072         16
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             2     131072         16
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             3     131072         16
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             4     196608         24
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             5    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             6    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             7    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             8    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                             9    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                            10    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                            11    1048576        128
SYS_SUBP26737                  TABLE SUBPARTITION DESTINATION                            12    1048576        128
                                                                                            ---------- ----------
sum                                                                                            9109504       1112

So there you go-I don’t really know what to say other than “d’oh”. It probably doesn’t hurt you too much but it’s still worth knowing that if you are consolidating thousands of segments in a previously empty tablespace you can get into trouble. If I work out why it does this I’ll post a follow-up. For now the work around is to move segments serially. Yes I know.

Posted in 11g Release 2 | 7 Comments »

Applying GI PSU 12.1.0.1.2 in the lab

Posted by Martin Bach on January 15, 2014

In my previous posts about the first RAC Grid Infrastructure Patchset I document a few issues I encountered that were worth noting. But where things work as advertised I am more than happy to document it too. In a way, the January 2014 GI PSU works as you’d hope it would (at least in my lab for my 2 node cluster). Well-almost: if you have a non 12.1 database in your environment you might encounter this.

UPDATE: You might want review some additional information with regards to datapatch.

Admittedly it’s taken from an Oracle Restart (i.e. non cluster) environment but I can’t see this not happening in RAC:

[root@server1 stage]# opatchauto apply /u01/stage/12.1.0.1.2/17735306 -ocmrf /u01/stage/ocm.rsp
OPatch Automation Tool
Copyright (c) 2013, Oracle Corporation.  All rights reserved.

OPatchauto version : 12.1.0.1.2
OUI version        : 12.1.0.1.0
Running from       : /u01/app/grid/product/12.1.0.1/grid

opatchauto log file: /u01/app/grid/product/12.1.0.1/grid/cfgtoollogs/opatchauto/17735306/ -
opatch_gi_2014-02-17_20-04-54_deploy.log

Parameter Validation: Successful

System Configuration Collection failed: oracle.osysmodel.driver.crs.productdriver.ProductDriverException:
PRCD-1027 : Failed to retrieve database ora11
PRCD-1229 : An attempt to access configuration of database ora11 was rejected because its
 version 11.2.0.3.0 differs from the program version 12.1.0.1.0. Instead run the program
from /u01/app/oracle/product/11.2.0.3.

opatchauto failed with error code 2.
[root@server1 stage]#

The environment

I have a 2 node RAC cluster that hasn’t previously been patched. It uses GNS for name resolution and is probably one of the more complex setups. So I was keen to test the new GI Patch Set Update. Note that a PSU will roll back previous patches, it’s not a delta as you might know from RPM.

The cluster installation using GNS was slightly more challenging than the usual RAC system so I decided not to try separation of duties. In other words, the GRID and RDBMS owner are identical: oracle.

Staging

As always you have to update OPatch before you can apply the latest patch. I did this in all 4 homes: 2 RDBMS and 2 Grid homes. Download and stage the patch to a location you like by unzipping it.

Next you need the ocm response file. I simply copied it from the other cluster.

Patch application

I started the installation of the patch on node 1 of my cluster in a screen session as always. Node 1 was responsible for the GNS resource at the time. Following the instructions and some previous experience I decided to let opatchauto do all of the work. And this time Oracle even included the “apply” keyword in the docs! And it didn’t bail out immediately with obscure error messages either…

[root@rac12node1 temp]# opatchauto apply /u01/temp/17735306 -ocmrf /u01/temp/ocm.rsp
OPatch Automation Tool
Copyright (c) 2013, Oracle Corporation.  All rights reserved.

OPatchauto version : 12.1.0.1.2
OUI version        : 12.1.0.1.0
Running from       : /u01/app/12.1.0.1/grid

opatchauto log file: /u01/app/12.1.0.1/grid/cfgtoollogs/opatchauto/17735306/opatch_gi_2014-01-15_08-54-25_deploy.log

Parameter Validation: Successful

Grid Infrastructure home:
/u01/app/12.1.0.1/grid
RAC home(s):
/u01/app/oracle/product/12.1.0.1/dbhome_1

Configuration Validation: Successful

Patch Location: /u01/temp/17735306
Grid Infrastructure Patch(es): 17077442 17303297 17552800
RAC Patch(es): 17077442 17552800

Patch Validation: Successful

Stopping RAC (/u01/app/oracle/product/12.1.0.1/dbhome_1) ... Successful
Following database(s) were stopped and will be restarted later during the session: rac12c

Applying patch(es) to "/u01/app/oracle/product/12.1.0.1/dbhome_1" ...
Patch "/u01/temp/17735306/17077442" successfully applied to "/u01/app/oracle/product/12.1.0.1/dbhome_1".
Patch "/u01/temp/17735306/17552800" successfully applied to "/u01/app/oracle/product/12.1.0.1/dbhome_1".

Stopping CRS ... Successful

Applying patch(es) to "/u01/app/12.1.0.1/grid" ...
Patch "/u01/temp/17735306/17077442" successfully applied to "/u01/app/12.1.0.1/grid".
Patch "/u01/temp/17735306/17303297" successfully applied to "/u01/app/12.1.0.1/grid".
Patch "/u01/temp/17735306/17552800" successfully applied to "/u01/app/12.1.0.1/grid".

Starting CRS ... Successful

Starting RAC (/u01/app/oracle/product/12.1.0.1/dbhome_1) ... Successful

SQL changes, if any, are applied successfully on the following database(s): RAC12C

Apply Summary:
Following patch(es) are successfully installed:
GI Home: /u01/app/12.1.0.1/grid: 17077442, 17303297, 17552800
RAC Home: /u01/app/oracle/product/12.1.0.1/dbhome_1: 17077442, 17552800

opatchauto succeeded.

Wow, I could hardly believe my eyes. I “tail”d the logfile in a different session to see what it was doing but the output seems less verbose these days than initial when “opatch auto” came out. See further down in the article about useful log locations.

A few remarks. It took quite a while for the patch to advance past “Starting CRS…” I remember staring at the screen anxiously in previous patch exercises and witnessed the CRS start sequence timing out. In this case there is a valid reason for the slowness, and it’s documented in $GRID_HOME/cfgtoollogs/crsconfig/crspatch*.log: it validates ACFS and applies patches to the -MGMTDB database. In fact you should open a third session to tail the crspatch log file to see what is happening as it provides a wealth of information about stopping and starting Clusterware resources.

With GI PSU 12.1.0.1.1 I ran into problems with CRS which got confused about who is mastering the OCR. With 12.1.0.1.2 this didn’t happen. CRS started successfully on the last node, and even started the database instance.

Log locations

Information about what’s going on can be found in many locations…

    • $GRID_HOME/cfgtoollogs/opatchauto/17735306/opatch_gi_timestamp_deploy.log for the opatchauto process
    • $GRID_HOME/cfgtoollogs/crsconfig/crspatch_hostname_timestamp.log records stop, unlocking, start and locking of Clusterware

More log information can be found in:

  • $GRID_HOME/cfgtoollogs/opatch contains logs for the opatch commands issued against the GRID home
  • $ORACLE_HOME/cfgtoollogs/opatch/opatchtimestamp.log – records output from the individual opatch commands issued against the RDBMS home.

Note these logs are not written to in that sequence. Start with the opatchauto logs, then refer to the more specific log locations for individual troubleshooting. The complete log file names are referenced at the end of the opatchauto log.

SQLPatch

The post installation steps require you to load SQL scripts into the database-it seems to have done that, as shown in the opatchauto log file:

2014-01-15_09-32-13 :
Ignoring the dbconsole command.
2014-01-15_09-32-17 :
Executing command:
/bin/bash -c 'ORACLE_HOME=/u01/app/oracle/product/12.1.0.1/dbhome_1 \
ORACLE_SID=$(/u01/app/oracle/product/12.1.0.1/dbhome_1/bin/srvctl status instance -d RAC12C -n rac12node1 | cut -d " " -f 2) \
/u01/app/oracle/product/12.1.0.1/dbhome_1/OPatch/datapatch'
2014-01-15_09-33-45 :
Successfully executed the above command.

SQL changes, if any, are applied successfully on the following database(s): RAC12C

Except that I could not find a log file in $ORACLE_HOME/sqlpatch/17552800 directory in the RDBMS home on the first node. The -MGMTDB has been patched, the log is in the $GRID_HOME. The registry$history view didn’t show any sign of a patch either. Only after applying the PSU on the last node did Oracle patch the database.

Update

It appears that PSU 1 and 2 have issues when running datapatch. I found this in the output of my patch log:

Executing command:
/bin/bash -c 'ORACLE_HOME=/u01/app/oracle/product/12.1.0.1/dbhome_2 ORACLE_SID=$(/u01/app/oracle/product/12.1.0.1/dbhome_2/bin/srvctl sta
tus instance -d RCDB1 -n rac12node3 | cut -d " " -f 2) /u01/app/oracle/product/12.1.0.1/dbhome_2/OPatch/datapatch'
2014-03-21_11-25-36 :

COMMAND EXECUTION FAILURE :
SQL Patching tool version 12.1.0.1.0 on Fri Mar 21 11:23:26 2014
Copyright (c) 2013, Oracle.  All rights reserved.

Connecting to database...OK
Determining current state...done
For the following PDBs: CDB$ROOT
  Nothing to roll back
  The following patches will be applied: 17552800
For the following PDBs: PDB$SEED
  Nothing to roll back
  The following patches will be applied: 17552800
For the following PDBs: ACDEMOPDB
  Nothing to roll back
  The following patches will be applied: 17552800
For the following PDBs: DUBLIN
  Nothing to roll back
  The following patches will be applied: 17552800
Adding patches to installation queue...done
Installing patches...done
Validating logfiles...

ERROR:
mkdir: cannot create directory `/u01/app/grid/cfgtoollogs': Permission denied
mkdir: cannot create directory `/u01/app/grid/cfgtoollogs': Permission denied
mkdir: cannot create directory `/u01/app/grid/cfgtoollogs': Permission denied
mkdir: cannot create directory `/u01/app/grid/cfgtoollogs': Permission denied
DBD::Oracle::st execute failed: ORA-03113: end-of-file on communication channel
Process ID: 20551
Session ID: 21 Serial number: 9 (DBD ERROR: OCIStmtExecute) [for Statement "ALTER PLUGGABLE DATABASE pdb$seed
                 CLOSE IMMEDIATE INSTANCES=ALL"] at /u01/app/oracle/product/12.1.0.1/dbhome_2/sqlpatch/sqlpatch.pm line 448, <LOGFILE> line 6129.

The corresponding bug number is 17421502, and a patch exists. I downloaded the patch and applied it on all of my nodes (rolling!). After everything has come up, I re-ran datapatch and this time it seemed to have worked without an “end-of-file on communication channel”. See MOS Datapatch may fail while patching a RAC+PDB environment (Doc ID 1632818.1) for more information.


Note that you have to set your ORACLE_SID to the database to be patched for the above command to work.

Summary

Applying the January PSU to my cluster worked flawlessly, definitely a huge improvement over the last PSU which was a bit of a disaster.

Posted in 12c Release 1, KVM, Linux | Tagged: | Leave a Comment »

 
Follow

Get every new post delivered to your Inbox.

Join 2,024 other followers