Category Archives: War Stories

Tale of a platform migration Solaris 10 SPARC 10.2.0.5 to Linux 11.2.0.2.6

This is as much a note to myself how to do this in the future as it is something hopefully worth reading for you. The requirement has been precise as always: migrate a database from 10.2 on SPARC to 11.2 on Linux. In the process, go from Veritas to ASM and make it quick!

I like short briefings but this was too short. Since the database was reasonably large I opted for the transportable tablespace approach, however I now think that a massively parallel impdp with network_link could have saved me quite a bit of time.

The following is by no means  the complete story, but hopefully gives you an idea how to do these things. Always check, and document, then test (rinse and repeat). Only when proper signoff is received should you try such a process in production. Remember to script it and have at least one clean run of the scripts! This process is not super-quick, if you have low downtime requirements then consider Streams or better: Golden Gate for the process.

The source database was originally not on the terminal release, and due to certain problems with the Data Pump API before 10.2.0.5 the source was moved to the terminal release. The source was 11g Release 2 patchset 1 with the April PSU applied

Things to think about

Since I couldn’t simply go for a subset of the database with my transportable tablespace set (TTS) I had to ensure that a lot of metadata was carried across. Personally I think that TTS works best for tables and indexes!

The process of transporting/converting tablespaces is short and sweet (excluding dealing with the application):

  1. Define a self-contained set of tablespaces. In other words, the tablespaces you export from the source must not contain dictionary references to other, non-exported tablespaces. For instance, you cannot export a tablespace containing a table that has an index on another outside of the transportable set.
  2. Set the tablespaces you want to export read-only. This is an outage in production!
  3. Export the metadata associated with the tablespaces from the source.
  4. Copy tablespaces to their destination
  5. Perform the platform conversion
  6. Optionally make the tablespace read-write. Thanks for Jerry for pointing this out
  7. Import tablespace metadata
  8. Make new tablespaces read-write in source

Continue reading

Advertisements

An interesting problem with ext4 on Oracle Linux 5.5

I have run into an interesting problem with my Red Hat 5.5 installation. Naively I assumed that ext4 has been around for a long time it would be stable. For a test I performed for a friend, I created my database files on a file system formatted with ext4 and mounted it the same way I would have mounted an ext3 file system:

$ mount | grep ext4
/dev/mapper/mpath43p1 on /u02/oradata type ext4 (rw)

Now when I tried to create a data file within a tablespace of a certain size, I got block corruption which I found very interesting. My first thought was: you must have a corruption of the file system. So I shut down all processes accessing /u02/oradata and gave the file system a thorough checking. Continue reading

Troubleshooting Oracle agent 12.1.0.1.0

As you may have read on this blog I recently moved from Oracle Enterprise Manager 11.1 GRID control to the full control of the cloud-12.1 has taken its place in the lab.

I also managed to install agents via self download (my OEM is x86 to reduce the footprint) on a 2 node 11.2.0.3 cluster: rac11203node1 and rac11203node2. After a catastrophic crash of both nodes followed by a reboot none of the agents wanted to report back to the OMS.

The difference

Oracle 12.1 has a new agent structure: where you used the agent base directory in previous releases to create the AGENT_HOME this now changed. In 11.1 I could specify the agent base to be /u01/app/oracle/product, and OUI would deploy everything in a subdirectory it creates, called agent11g (or agent 10g for 10.2.x).

Now I set the agent base to the same value and installed my agents in parallel, but found that there is no agent12c directory under the base. Instead I found these:

[oracle@rac11203node1 product]$ ls -l
total 48
drwxr-xr-x. 73 oracle oinstall  4096 Oct 27 22:40 11.2.0.3
-rw-rw-r--.  1 oracle oinstall    91 Sep 23 08:52 agentimage.properties
drwxr-xr-x.  6 oracle oinstall  4096 Oct 28 14:57 agent_inst
drwxr-xr-x.  3 oracle oinstall  4096 Oct 15 21:35 core
drwx------.  2 oracle oinstall 16384 Oct 14 21:02 lost+found
drwxr-xr-x.  8 oracle oinstall  4096 Oct 15 21:50 plugins
-rwxr-xr-x.  1 oracle oinstall   223 Oct 15 21:25 plugins.txt
-rw-r--r--.  1 oracle oinstall   298 Oct 15 21:42 plugins.txt.status
drwxr-xr-x.  5 oracle oinstall  4096 Oct 15 21:43 sbin

So it’s all a bit different. The core/ directory contains the agent binaries. The agent_inst directory contains the the sysman directory. This is where all the configuration and state information is stored. In that respect the sysman directory is the same as in 11.1.

Now back to my problem-both agents that previously used to work fine were reported “unavailable”. The agent information is no longer in the setup-agents-management agents.

For 12.1 you need to navigate to setup-agents from the top down menu in the upper right corner.This takes you to the overview page. OK, so I could see the agents weren’t communicating with the OMS.

On the machine I could see this:

[oracle@rac11203node1 log]$ emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : (unknown)
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/app/oracle/product/agent_inst
Agent Binaries     : /u01/app/oracle/product/core/12.1.0.1.0
Agent Process ID   : 13270
Parent Process ID  : 13215
Agent URL          : https://rac11203node1.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-26 18:30:17
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : (none)
Last attempted upload                        : (none)
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 1,858
Size of XML files pending upload(MB)         : 8.05
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:42:47
Last successful heartbeat to OMS             : (none)

---------------------------------------------------------------
Agent is Running and Ready

The settings are correct, I have verified that with another, uploading and otherwise fine agent. I have also secured the agent, and $AGENT_BASE/agent_inst/sysman/log/secure.log as well as the emctl secure agent commands reported normal, successful operation.

Still the stubborn thing doesn’t want to talk to the OMS – in the agent overview page both agents are listed as “unavailable”, but not blocked. When I force an upload, I get this:

[oracle@rac11203node1 log]$ emctl upload
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
EMD upload error:full upload has failed: uploadXMLFiles skipped :: OMS version not checked yet. If this issue persists check trace files for ping to OMS related errors. (OMS_DOWN)

However it’s not down, I can reach it from another agent (which happens to be on the same box as the OMS)

[oracle@oem12oms 12.1.0.1.0]$ $ORACLE_HOME/bin/emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : 12.1.0.1.0
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/gc12.1/agent/agent_inst
Agent Binaries     : /u01/gc12.1/agent/core/12.1.0.1.0
Agent Process ID   : 2964
Parent Process ID  : 2910
Agent URL          : https://oem12oms.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-15 21:00:37
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : 2011-10-27 15:46:38
Last attempted upload                        : 2011-10-27 15:46:38
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:48:34
Last successful heartbeat to OMS             : 2011-10-27 15:48:34

---------------------------------------------------------------
Agent is Running and Ready

And no, the firewall is turned off and I can connect to the upload from any machine in the network:

[oracle@rac11203node1 log]$ wget --no-check-certificate https://oem12oms.localdomain:4901/empbs/upload
--2011-10-27 15:55:46-- https://oem12oms.localdomain:4901/empbs/upload
Resolving oem12oms.localdomain... 192.168.99.28
Connecting to oem12oms.localdomain|192.168.99.28|:4901... connected.
WARNING: cannot verify oem12oms.localdomain’s certificate, issued by “/O=EnterpriseManager on oem12oms.localdomain/OU=EnterpriseManager on oem12oms.localdomain/L=EnterpriseManager on oem12oms.localdomain/ST=CA/C=US/CN=oem12oms.localdomain”:
Self-signed certificate encountered.
HTTP request sent, awaiting response... 200 OK
Length: 314 [text/html]
Saving to: “upload.1”

100%[======================================>] 314 --.-K/s in 0s

2011-10-27 15:55:46 (5.19 MB/s) - “upload.1” saved [314/314]

The agent complains about this in gcagent.log:

2011-10-27 15:56:08,947 [37:3F09CD9C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,471 [167:E3E93C4C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,472 [167:E3E93C4C] WARN – Ping protocol error
o.s.gcagent.ping.PingProtocolException [OMS sent an invalid response: “BACKOFF::180000”]

At least someone in Oracle has some humour when it comes to this.

The Solution

Now I dug around a lot more and finally managed to get to the conclusion. It was actually a two fold problem. The first agent was simply blocked. After finding a way to unblock it, it worked happily.

The second agent was a bit more trouble. I unblocked it as well from the agent page in OEM, which failed. As it turned out the agent was shut down. And it didn’t start either:

[oracle@rac11203node2 12.1.0.1.0]$ emctl start agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation.  All rights reserved.
Starting agent ............. failed.
Target Metadata Loader failed at Startup
Consult the log files in: /u01/app/oracle/product/agent_inst/sysman/log

I checked the logs and found this interesting bit of information:

2011-10-24 21:35:21,387 [1:3305B9] INFO - Plugin oracle.sysman.oh is now active
2011-10-24 21:35:21,393 [1:3305B9] INFO - Plugin oracle.sysman.db is now active
2011-10-24 21:35:21,396 [1:3305B9] WARN - Agent failed to Startup for Target Metadata Loader in step 2
oracle.sysman.gcagent.metadata.MetadataLoadingException: The targets.xml file is empty
at oracle.sysman.gcagent.metadata.MetadataManager$Loader.validateMetadataFile(MetadataManager.java:799)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.processMDFile(MetadataManager.java:1733)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.readRegistry(MetadataManager.java:1695)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.load(MetadataManager.java:1641)
at oracle.sysman.gcagent.metadata.MetadataManager.load(MetadataManager.java:282)
at oracle.sysman.gcagent.metadata.MetadataManager.runStartupStep(MetadataManager.java:450)
at oracle.sysman.gcagent.metadata.MetadataManager.tmNotifier(MetadataManager.java:337)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeNotifier(TMComponentSvc.java:876)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeInitializationStep(TMComponentSvc.java:959)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.doInitializationStep(TMComponentSvc.java:800)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.notifierDriver(TMComponentSvc.java:740)
at oracle.sysman.gcagent.tmmain.TMMain.startup(TMMain.java:215)
at oracle.sysman.gcagent.tmmain.TMMain.agentMain(TMMain.java:458)
at oracle.sysman.gcagent.tmmain.TMMain.main(TMMain.java:447)
2011-10-24 21:35:21,397 [1:3305B9] INFO - Agent exiting with exit code 55
2011-10-24 21:35:21,398 [31:F9C26A76:Shutdown] INFO - *jetty*: Shutdown hook executing
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown SslSelectChannelConnector@0.0.0.0:3872
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown ContextHandler@14d964af@14d964af/emd/lifecycle/main,null

I yet have to find the reason for the empty targets.xml file but sure enough it existed with 0 byes length.

Simple enough I thought, all I need to do is run agentca to repopulate the file. Unfortunately I couldn’t find it.

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was a bit of a let down. Then I decided to create a new targets.xml file and try a resynchronisation of the agent.This is a well hidden menu item so I dedided to show it here:

The only element that went into targets.xml was “<targets />”. This was sufficient to start the agent, which is a requirement for the resynchronisation to succeed. I was quite amazed that this succeeded, but it did:

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was very encouraging, and both agents are now working properly.

Troubleshooting Grid Infrastructure startup

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks.
  • Private bondn device defined on a VLAN
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong? Continue reading

Troubleshooting ora.net1.network on an 8 node cluster

It seems I am doing a lot of fixing broken stuff recently. So this time I have been asked to repair a broken 8 node RAC cluster on OEL 5.5 with Oracle RAC 11.2.0.2. The system has been moved into a different, more secure network, and its firewalls prevented all access to the machines except for ILO. Another way of “security through obscurity”. The new network didn’t allow any clients to connect to any of the 8 node RAC which means that it is actually quite expensive kit to sit idle. The cluster is not in production, it’s still being build to specification but this accessibility problem has been a holdup to the project for a little while now. Yesterday has been a breakthrough-the netops team found an error to their configuration and for the first time the hosts could be accessed via ssh. Unfortunately for me that access is possible via audited gateways using PowerBroker to which I don’t have access. Continue reading

Oracle Support-final update to SR

Just had a really pleasent exchange with Oracle support. I was after a way to purge the repository database of an OEM 11.1 Grid Control installation without having to blow it all away. Unfortunately, there is no such option. However, what I liked was this final update from the support member:

Generic Note
————————
Martin,

From sunny Colorado – blue sky and SNOW! – I do wish we could have provided a better option.

But I do want to thank you so much for your kindness and patience. You are the best kind of customer to work with. That means a lot, in these challenging jobs.

Very best,
Thom

The whole SR was well and competently managed by Thom, and at no time did he come up with techniques to buy more time by asking for irrelevant log files or similar. I wish more support staff were like him.

UKOUG RAC&HA SIG September 2010

Just a quick one to announce that I’ll present at said event. Here’s the short synopsis of my talk:

Upgrading to Oracle Real Application Cluster 11.2

With the end of premier support in sight mid 2011 many business start looking at possible upgrade paths. With the majority of RAC systems deployed on Oracle 10g, there is a strong demand to upgrade these systems to 11.2. The presentation focuses on different upgrade paths, including Grid Infrastructure and the RDBMS. Alternative approaches to upgrading the software will be discussed as well. Experience from migrations performed at a large financial institution round the presentation up.