Martins Blog

Trying to explain complex things in simple terms

Archive for the ‘War Stories’ Category

Tale of a platform migration Solaris 10 SPARC 10.2.0.5 to Linux 11.2.0.2.6

Posted by Martin Bach on May 14, 2012

This is as much a note to myself how to do this in the future as it is something hopefully worth reading for you. The requirement has been precise as always: migrate a database from 10.2 on SPARC to 11.2 on Linux. In the process, go from Veritas to ASM and make it quick!

I like short briefings but this was too short. Since the database was reasonably large I opted for the transportable tablespace approach, however I now think that a massively parallel impdp with network_link could have saved me quite a bit of time.

The following is by no means  the complete story, but hopefully gives you an idea how to do these things. Always check, and document, then test (rinse and repeat). Only when proper signoff is received should you try such a process in production. Remember to script it and have at least one clean run of the scripts! This process is not super-quick, if you have low downtime requirements then consider Streams or better: Golden Gate for the process.

The source database was originally not on the terminal release, and due to certain problems with the Data Pump API before 10.2.0.5 the source was moved to the terminal release. The source was 11g Release 2 patchset 1 with the April PSU applied

Things to think about

Since I couldn’t simply go for a subset of the database with my transportable tablespace set (TTS) I had to ensure that a lot of metadata was carried across. Personally I think that TTS works best for tables and indexes!

The process of transporting/converting tablespaces is short and sweet (excluding dealing with the application):

  1. Define a self-contained set of tablespaces. In other words, the tablespaces you export from the source must not contain dictionary references to other, non-exported tablespaces. For instance, you cannot export a tablespace containing a table that has an index on another outside of the transportable set.
  2. Set the tablespaces you want to export read-only. This is an outage in production!
  3. Export the metadata associated with the tablespaces from the source.
  4. Copy tablespaces to their destination
  5. Perform the platform conversion
  6. Optionally make the tablespace read-write. Thanks for Jerry for pointing this out
  7. Import tablespace metadata
  8. Make new tablespaces read-write in source

Read the rest of this entry »

Posted in 10g Release 2, 11g Release 2, Linux, War Stories | 4 Comments »

An interesting problem with ext4 on Oracle Linux 5.5

Posted by Martin Bach on November 4, 2011

I have run into an interesting problem with my Red Hat 5.5 installation. Naively I assumed that ext4 has been around for a long time it would be stable. For a test I performed for a friend, I created my database files on a file system formatted with ext4 and mounted it the same way I would have mounted an ext3 file system:

$ mount | grep ext4
/dev/mapper/mpath43p1 on /u02/oradata type ext4 (rw)

Now when I tried to create a data file within a tablespace of a certain size, I got block corruption which I found very interesting. My first thought was: you must have a corruption of the file system. So I shut down all processes accessing /u02/oradata and gave the file system a thorough checking. Read the rest of this entry »

Posted in 11g Release 2, Linux, War Stories | 6 Comments »

Troubleshooting Oracle agent 12.1.0.1.0

Posted by Martin Bach on October 28, 2011

As you may have read on this blog I recently moved from Oracle Enterprise Manager 11.1 GRID control to the full control of the cloud-12.1 has taken its place in the lab.

I also managed to install agents via self download (my OEM is x86 to reduce the footprint) on a 2 node 11.2.0.3 cluster: rac11203node1 and rac11203node2. After a catastrophic crash of both nodes followed by a reboot none of the agents wanted to report back to the OMS.

The difference

Oracle 12.1 has a new agent structure: where you used the agent base directory in previous releases to create the AGENT_HOME this now changed. In 11.1 I could specify the agent base to be /u01/app/oracle/product, and OUI would deploy everything in a subdirectory it creates, called agent11g (or agent 10g for 10.2.x).

Now I set the agent base to the same value and installed my agents in parallel, but found that there is no agent12c directory under the base. Instead I found these:

[oracle@rac11203node1 product]$ ls -l
total 48
drwxr-xr-x. 73 oracle oinstall  4096 Oct 27 22:40 11.2.0.3
-rw-rw-r--.  1 oracle oinstall    91 Sep 23 08:52 agentimage.properties
drwxr-xr-x.  6 oracle oinstall  4096 Oct 28 14:57 agent_inst
drwxr-xr-x.  3 oracle oinstall  4096 Oct 15 21:35 core
drwx------.  2 oracle oinstall 16384 Oct 14 21:02 lost+found
drwxr-xr-x.  8 oracle oinstall  4096 Oct 15 21:50 plugins
-rwxr-xr-x.  1 oracle oinstall   223 Oct 15 21:25 plugins.txt
-rw-r--r--.  1 oracle oinstall   298 Oct 15 21:42 plugins.txt.status
drwxr-xr-x.  5 oracle oinstall  4096 Oct 15 21:43 sbin

So it’s all a bit different. The core/ directory contains the agent binaries. The agent_inst directory contains the the sysman directory. This is where all the configuration and state information is stored. In that respect the sysman directory is the same as in 11.1.

Now back to my problem-both agents that previously used to work fine were reported “unavailable”. The agent information is no longer in the setup-agents-management agents.

For 12.1 you need to navigate to setup-agents from the top down menu in the upper right corner.This takes you to the overview page. OK, so I could see the agents weren’t communicating with the OMS.

On the machine I could see this:

[oracle@rac11203node1 log]$ emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : (unknown)
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/app/oracle/product/agent_inst
Agent Binaries     : /u01/app/oracle/product/core/12.1.0.1.0
Agent Process ID   : 13270
Parent Process ID  : 13215
Agent URL          : https://rac11203node1.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-26 18:30:17
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : (none)
Last attempted upload                        : (none)
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 1,858
Size of XML files pending upload(MB)         : 8.05
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:42:47
Last successful heartbeat to OMS             : (none)

---------------------------------------------------------------
Agent is Running and Ready

The settings are correct, I have verified that with another, uploading and otherwise fine agent. I have also secured the agent, and $AGENT_BASE/agent_inst/sysman/log/secure.log as well as the emctl secure agent commands reported normal, successful operation.

Still the stubborn thing doesn’t want to talk to the OMS – in the agent overview page both agents are listed as “unavailable”, but not blocked. When I force an upload, I get this:

[oracle@rac11203node1 log]$ emctl upload
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
EMD upload error:full upload has failed: uploadXMLFiles skipped :: OMS version not checked yet. If this issue persists check trace files for ping to OMS related errors. (OMS_DOWN)

However it’s not down, I can reach it from another agent (which happens to be on the same box as the OMS)

[oracle@oem12oms 12.1.0.1.0]$ $ORACLE_HOME/bin/emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : 12.1.0.1.0
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/gc12.1/agent/agent_inst
Agent Binaries     : /u01/gc12.1/agent/core/12.1.0.1.0
Agent Process ID   : 2964
Parent Process ID  : 2910
Agent URL          : https://oem12oms.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-15 21:00:37
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : 2011-10-27 15:46:38
Last attempted upload                        : 2011-10-27 15:46:38
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:48:34
Last successful heartbeat to OMS             : 2011-10-27 15:48:34

---------------------------------------------------------------
Agent is Running and Ready

And no, the firewall is turned off and I can connect to the upload from any machine in the network:

[oracle@rac11203node1 log]$ wget --no-check-certificate https://oem12oms.localdomain:4901/empbs/upload
--2011-10-27 15:55:46-- https://oem12oms.localdomain:4901/empbs/upload
Resolving oem12oms.localdomain... 192.168.99.28
Connecting to oem12oms.localdomain|192.168.99.28|:4901... connected.
WARNING: cannot verify oem12oms.localdomain’s certificate, issued by “/O=EnterpriseManager on oem12oms.localdomain/OU=EnterpriseManager on oem12oms.localdomain/L=EnterpriseManager on oem12oms.localdomain/ST=CA/C=US/CN=oem12oms.localdomain”:
Self-signed certificate encountered.
HTTP request sent, awaiting response... 200 OK
Length: 314 [text/html]
Saving to: “upload.1”

100%[======================================>] 314 --.-K/s in 0s

2011-10-27 15:55:46 (5.19 MB/s) - “upload.1” saved [314/314]

The agent complains about this in gcagent.log:

2011-10-27 15:56:08,947 [37:3F09CD9C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,471 [167:E3E93C4C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,472 [167:E3E93C4C] WARN – Ping protocol error
o.s.gcagent.ping.PingProtocolException [OMS sent an invalid response: “BACKOFF::180000”]

At least someone in Oracle has some humour when it comes to this.

The Solution

Now I dug around a lot more and finally managed to get to the conclusion. It was actually a two fold problem. The first agent was simply blocked. After finding a way to unblock it, it worked happily.

The second agent was a bit more trouble. I unblocked it as well from the agent page in OEM, which failed. As it turned out the agent was shut down. And it didn’t start either:

[oracle@rac11203node2 12.1.0.1.0]$ emctl start agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation.  All rights reserved.
Starting agent ............. failed.
Target Metadata Loader failed at Startup
Consult the log files in: /u01/app/oracle/product/agent_inst/sysman/log

I checked the logs and found this interesting bit of information:

2011-10-24 21:35:21,387 [1:3305B9] INFO - Plugin oracle.sysman.oh is now active
2011-10-24 21:35:21,393 [1:3305B9] INFO - Plugin oracle.sysman.db is now active
2011-10-24 21:35:21,396 [1:3305B9] WARN - Agent failed to Startup for Target Metadata Loader in step 2
oracle.sysman.gcagent.metadata.MetadataLoadingException: The targets.xml file is empty
at oracle.sysman.gcagent.metadata.MetadataManager$Loader.validateMetadataFile(MetadataManager.java:799)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.processMDFile(MetadataManager.java:1733)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.readRegistry(MetadataManager.java:1695)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.load(MetadataManager.java:1641)
at oracle.sysman.gcagent.metadata.MetadataManager.load(MetadataManager.java:282)
at oracle.sysman.gcagent.metadata.MetadataManager.runStartupStep(MetadataManager.java:450)
at oracle.sysman.gcagent.metadata.MetadataManager.tmNotifier(MetadataManager.java:337)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeNotifier(TMComponentSvc.java:876)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeInitializationStep(TMComponentSvc.java:959)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.doInitializationStep(TMComponentSvc.java:800)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.notifierDriver(TMComponentSvc.java:740)
at oracle.sysman.gcagent.tmmain.TMMain.startup(TMMain.java:215)
at oracle.sysman.gcagent.tmmain.TMMain.agentMain(TMMain.java:458)
at oracle.sysman.gcagent.tmmain.TMMain.main(TMMain.java:447)
2011-10-24 21:35:21,397 [1:3305B9] INFO - Agent exiting with exit code 55
2011-10-24 21:35:21,398 [31:F9C26A76:Shutdown] INFO - *jetty*: Shutdown hook executing
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown SslSelectChannelConnector@0.0.0.0:3872
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown ContextHandler@14d964af@14d964af/emd/lifecycle/main,null

I yet have to find the reason for the empty targets.xml file but sure enough it existed with 0 byes length.

Simple enough I thought, all I need to do is run agentca to repopulate the file. Unfortunately I couldn’t find it.

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was a bit of a let down. Then I decided to create a new targets.xml file and try a resynchronisation of the agent.This is a well hidden menu item so I dedided to show it here:

The only element that went into targets.xml was “<targets />”. This was sufficient to start the agent, which is a requirement for the resynchronisation to succeed. I was quite amazed that this succeeded, but it did:

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was very encouraging, and both agents are now working properly.

Posted in Grid Control, Linux, War Stories | 2 Comments »

Troubleshooting Grid Infrastructure startup

Posted by Martin Bach on March 17, 2011

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks.
  • Private bondn device defined on a VLAN
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong? Read the rest of this entry »

Posted in 11g Release 2, Automatic Storage Management, Linux, War Stories | 8 Comments »

Troubleshooting ora.net1.network on an 8 node cluster

Posted by Martin Bach on February 11, 2011

It seems I am doing a lot of fixing broken stuff recently. So this time I have been asked to repair a broken 8 node RAC cluster on OEL 5.5 with Oracle RAC 11.2.0.2. The system has been moved into a different, more secure network, and its firewalls prevented all access to the machines except for ILO. Another way of “security through obscurity”. The new network didn’t allow any clients to connect to any of the 8 node RAC which means that it is actually quite expensive kit to sit idle. The cluster is not in production, it’s still being build to specification but this accessibility problem has been a holdup to the project for a little while now. Yesterday has been a breakthrough-the netops team found an error to their configuration and for the first time the hosts could be accessed via ssh. Unfortunately for me that access is possible via audited gateways using PowerBroker to which I don’t have access. Read the rest of this entry »

Posted in 11g Release 2, Linux, RAC, RAC Book, War Stories | Leave a Comment »

Oracle Support-final update to SR

Posted by Martin Bach on February 10, 2011

Just had a really pleasent exchange with Oracle support. I was after a way to purge the repository database of an OEM 11.1 Grid Control installation without having to blow it all away. Unfortunately, there is no such option. However, what I liked was this final update from the support member:

Generic Note
————————
Martin,

From sunny Colorado – blue sky and SNOW! – I do wish we could have provided a better option.

But I do want to thank you so much for your kindness and patience. You are the best kind of customer to work with. That means a lot, in these challenging jobs.

Very best,
Thom

The whole SR was well and competently managed by Thom, and at no time did he come up with techniques to buy more time by asking for irrelevant log files or similar. I wish more support staff were like him.

Posted in War Stories | Leave a Comment »

UKOUG RAC&HA SIG September 2010

Posted by Martin Bach on September 21, 2010

Just a quick one to announce that I’ll present at said event. Here’s the short synopsis of my talk:

Upgrading to Oracle Real Application Cluster 11.2

With the end of premier support in sight mid 2011 many business start looking at possible upgrade paths. With the majority of RAC systems deployed on Oracle 10g, there is a strong demand to upgrade these systems to 11.2. The presentation focuses on different upgrade paths, including Grid Infrastructure and the RDBMS. Alternative approaches to upgrading the software will be discussed as well. Experience from migrations performed at a large financial institution round the presentation up.

Posted in 11g Release 2, Public Appearances, War Stories | 1 Comment »

Patching CRS to 10.2.0.5

Posted by Martin Bach on May 26, 2010

A small blog post about patching CRS 10.2.0.1 to 10.2.0.5 instead of applying the latest PSU on top of 10.2.0.4.

So it finally happened! Oracle has released the terminal patchset for 10g Release 2. Long overdue! So the task at hand today is to upgrade CRS to 10.2.0.5.

Patch

Download patch 8202632 to a convenient location and unzip it. The patch is quite large, again, about 1.2G in size for Linux 64bit RHEL5.5, my platform.

[oracle@crs102node1 10205]$ ls -lh p8202632_10205_Linux-x86-64.zip
-rw-r--r-- 1 1000 users 1.2G May 25 16:48 p8202632_10205_Linux-x86-64.zip

The first thing that caught my attention was the built-in support for SLES 11 and RHEL 6. This is interesting as RHEL 6 is still beta, where SLES 11 is out for production use. Means no more hacking the oraparam.ini file then, good (if you are on one of these platforms).

[oracle@crs102node1 10205]$ ./Disk1/runInstaller
Starting Oracle Universal Installer...

Checking installer requirements...

Checking operating system version: must be redhat-3, SuSE-9, SuSE-10, redhat-4, redhat-5,
redhat-6, UnitedLinux-1.0, asianux-1, asianux-2, asianux-3, enterprise-4, enterprise-5 or SuSE-11
 Passed

All installer requirements met.

Preparing to launch Oracle Universal Installer from /tmp/OraInstall2010-05-26_10-14-13AM.
Please wait ...[oracle@crs102node1 10205]$ Oracle Universal Installer, Version 10.2.0.5.0 Production
Copyright (C) 1999, 2010, Oracle. All rights reserved.

Read the rest of this entry »

Posted in Linux, RAC, War Stories | Tagged: , , , , | 3 Comments »

tnsping segmentation fault with 10.2 client

Posted by Martin Bach on May 20, 2010

The tale of a very interesting problem with a segmentation fault on RHEL 5.3 64bit when invoking tnsping. I initially thought the box the client was installed on (a development virtual machine) was seriously ill but it turned out to be something else altogether.

Here is the initial problem. One of the developers contacted me saying that he couldn’t connect to one of the databases. Sure enough, sqlplus wouldn’t connect:

[oracle@dev-vm-001 tns]$ sqlplus a/b@devone

SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 20 15:03:47 2010
Copyright (c) 1982, 2005, Oracle.  All rights reserved.

ERROR:
ORA-12154: TNS:could not resolve the connect identifier specified

Enter user-name: 

I thought that this was simple enough, the probably haven’t defined the database in the tnsnames.ora file. Checking the file revealed that I was wrong!

[oracle@dev-vm-001 tns]$ cat /u01/app/oracle/product/10.2.0/client_1/network/admin/tnsnames.ora 
DEVONE =
 (DESCRIPTION =
   (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = dev1db)(PORT = 1521)))
   (CONNECT_DATA =
    (SERVICE_NAME = DEVONE)))

So then maybe there is no connectivity to dev1db? tnsping would tell me.

[oracle@dev-vm-001 tns]$ tnsping devone

TNS Ping Utility for Linux: Version 10.2.0.1.0 - Production on 20-MAY-2010 14:59:10

Copyright (c) 1997, 2005, Oracle.  All rights reserved.

Segmentation fault

Read the rest of this entry »

Posted in Linux, War Stories | 1 Comment »

ORA-604 when starting the database

Posted by Martin Bach on April 28, 2010

Interesting scenario this morning with a development database. There is no specific monitoring in place for development systems so a user phoned us up stating that the database was inaccessible. The last lines in the alert.log showed a problem with a datafile not readable:

Wed Apr 28 10:18:08 2010
Errors in file /u01/app/oracle/admin/devone/bdump/devone_pmon_3416.trc:
ORA-00376: file 2 cannot be read at this time
ORA-01110: data file 2: '+DATB/DEVONE/datafile/undotbs1.356.703506859'

This is a 10.2.0.4.1 single instance database (a clone from our production RAC cluster) running on RHEL 5.4 64bit with ASMLib. Read the rest of this entry »

Posted in Oracle, War Stories | Leave a Comment »