Category Archives: Grid Control

Availability Infrastructure & Management SIG March 14th 2012

I am proud to be able to speak at the first instalment of the Availability, infrastructure and management SIG on March 14th in  the London City office.

The event is announced on the UKOUG website here:

http://www.ukoug.org/events/ukoug-availability-infrastructure-and-management-sig-meeting/

Unfortunately I will be between you and lunch! I hope that works out, and I don’t overrun.

I am going to demonstrate my (little) knowledge of Oracle Enterprise Manager 12.1. I looked at it for one of my customers and came to like it. As it is very different from the previous versions of the product, a more closely focused session seems appropriate. An Internet connection permitting, I am going to demonstrate navigation through the new interface, self update, target discovery and if time permits, I will patch a single instance HA environment (also known as Oracle Restart).

If all demos work this could be quite an entertaining sessions, questions are welcome!

Advertisements

Troubleshooting Oracle agent 12.1.0.1.0

As you may have read on this blog I recently moved from Oracle Enterprise Manager 11.1 GRID control to the full control of the cloud-12.1 has taken its place in the lab.

I also managed to install agents via self download (my OEM is x86 to reduce the footprint) on a 2 node 11.2.0.3 cluster: rac11203node1 and rac11203node2. After a catastrophic crash of both nodes followed by a reboot none of the agents wanted to report back to the OMS.

The difference

Oracle 12.1 has a new agent structure: where you used the agent base directory in previous releases to create the AGENT_HOME this now changed. In 11.1 I could specify the agent base to be /u01/app/oracle/product, and OUI would deploy everything in a subdirectory it creates, called agent11g (or agent 10g for 10.2.x).

Now I set the agent base to the same value and installed my agents in parallel, but found that there is no agent12c directory under the base. Instead I found these:

[oracle@rac11203node1 product]$ ls -l
total 48
drwxr-xr-x. 73 oracle oinstall  4096 Oct 27 22:40 11.2.0.3
-rw-rw-r--.  1 oracle oinstall    91 Sep 23 08:52 agentimage.properties
drwxr-xr-x.  6 oracle oinstall  4096 Oct 28 14:57 agent_inst
drwxr-xr-x.  3 oracle oinstall  4096 Oct 15 21:35 core
drwx------.  2 oracle oinstall 16384 Oct 14 21:02 lost+found
drwxr-xr-x.  8 oracle oinstall  4096 Oct 15 21:50 plugins
-rwxr-xr-x.  1 oracle oinstall   223 Oct 15 21:25 plugins.txt
-rw-r--r--.  1 oracle oinstall   298 Oct 15 21:42 plugins.txt.status
drwxr-xr-x.  5 oracle oinstall  4096 Oct 15 21:43 sbin

So it’s all a bit different. The core/ directory contains the agent binaries. The agent_inst directory contains the the sysman directory. This is where all the configuration and state information is stored. In that respect the sysman directory is the same as in 11.1.

Now back to my problem-both agents that previously used to work fine were reported “unavailable”. The agent information is no longer in the setup-agents-management agents.

For 12.1 you need to navigate to setup-agents from the top down menu in the upper right corner.This takes you to the overview page. OK, so I could see the agents weren’t communicating with the OMS.

On the machine I could see this:

[oracle@rac11203node1 log]$ emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : (unknown)
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/app/oracle/product/agent_inst
Agent Binaries     : /u01/app/oracle/product/core/12.1.0.1.0
Agent Process ID   : 13270
Parent Process ID  : 13215
Agent URL          : https://rac11203node1.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-26 18:30:17
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : (none)
Last attempted upload                        : (none)
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 1,858
Size of XML files pending upload(MB)         : 8.05
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:42:47
Last successful heartbeat to OMS             : (none)

---------------------------------------------------------------
Agent is Running and Ready

The settings are correct, I have verified that with another, uploading and otherwise fine agent. I have also secured the agent, and $AGENT_BASE/agent_inst/sysman/log/secure.log as well as the emctl secure agent commands reported normal, successful operation.

Still the stubborn thing doesn’t want to talk to the OMS – in the agent overview page both agents are listed as “unavailable”, but not blocked. When I force an upload, I get this:

[oracle@rac11203node1 log]$ emctl upload
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
EMD upload error:full upload has failed: uploadXMLFiles skipped :: OMS version not checked yet. If this issue persists check trace files for ping to OMS related errors. (OMS_DOWN)

However it’s not down, I can reach it from another agent (which happens to be on the same box as the OMS)

[oracle@oem12oms 12.1.0.1.0]$ $ORACLE_HOME/bin/emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : 12.1.0.1.0
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/gc12.1/agent/agent_inst
Agent Binaries     : /u01/gc12.1/agent/core/12.1.0.1.0
Agent Process ID   : 2964
Parent Process ID  : 2910
Agent URL          : https://oem12oms.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-15 21:00:37
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : 2011-10-27 15:46:38
Last attempted upload                        : 2011-10-27 15:46:38
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:48:34
Last successful heartbeat to OMS             : 2011-10-27 15:48:34

---------------------------------------------------------------
Agent is Running and Ready

And no, the firewall is turned off and I can connect to the upload from any machine in the network:

[oracle@rac11203node1 log]$ wget --no-check-certificate https://oem12oms.localdomain:4901/empbs/upload
--2011-10-27 15:55:46-- https://oem12oms.localdomain:4901/empbs/upload
Resolving oem12oms.localdomain... 192.168.99.28
Connecting to oem12oms.localdomain|192.168.99.28|:4901... connected.
WARNING: cannot verify oem12oms.localdomain’s certificate, issued by “/O=EnterpriseManager on oem12oms.localdomain/OU=EnterpriseManager on oem12oms.localdomain/L=EnterpriseManager on oem12oms.localdomain/ST=CA/C=US/CN=oem12oms.localdomain”:
Self-signed certificate encountered.
HTTP request sent, awaiting response... 200 OK
Length: 314 [text/html]
Saving to: “upload.1”

100%[======================================>] 314 --.-K/s in 0s

2011-10-27 15:55:46 (5.19 MB/s) - “upload.1” saved [314/314]

The agent complains about this in gcagent.log:

2011-10-27 15:56:08,947 [37:3F09CD9C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,471 [167:E3E93C4C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,472 [167:E3E93C4C] WARN – Ping protocol error
o.s.gcagent.ping.PingProtocolException [OMS sent an invalid response: “BACKOFF::180000”]

At least someone in Oracle has some humour when it comes to this.

The Solution

Now I dug around a lot more and finally managed to get to the conclusion. It was actually a two fold problem. The first agent was simply blocked. After finding a way to unblock it, it worked happily.

The second agent was a bit more trouble. I unblocked it as well from the agent page in OEM, which failed. As it turned out the agent was shut down. And it didn’t start either:

[oracle@rac11203node2 12.1.0.1.0]$ emctl start agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation.  All rights reserved.
Starting agent ............. failed.
Target Metadata Loader failed at Startup
Consult the log files in: /u01/app/oracle/product/agent_inst/sysman/log

I checked the logs and found this interesting bit of information:

2011-10-24 21:35:21,387 [1:3305B9] INFO - Plugin oracle.sysman.oh is now active
2011-10-24 21:35:21,393 [1:3305B9] INFO - Plugin oracle.sysman.db is now active
2011-10-24 21:35:21,396 [1:3305B9] WARN - Agent failed to Startup for Target Metadata Loader in step 2
oracle.sysman.gcagent.metadata.MetadataLoadingException: The targets.xml file is empty
at oracle.sysman.gcagent.metadata.MetadataManager$Loader.validateMetadataFile(MetadataManager.java:799)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.processMDFile(MetadataManager.java:1733)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.readRegistry(MetadataManager.java:1695)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.load(MetadataManager.java:1641)
at oracle.sysman.gcagent.metadata.MetadataManager.load(MetadataManager.java:282)
at oracle.sysman.gcagent.metadata.MetadataManager.runStartupStep(MetadataManager.java:450)
at oracle.sysman.gcagent.metadata.MetadataManager.tmNotifier(MetadataManager.java:337)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeNotifier(TMComponentSvc.java:876)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeInitializationStep(TMComponentSvc.java:959)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.doInitializationStep(TMComponentSvc.java:800)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.notifierDriver(TMComponentSvc.java:740)
at oracle.sysman.gcagent.tmmain.TMMain.startup(TMMain.java:215)
at oracle.sysman.gcagent.tmmain.TMMain.agentMain(TMMain.java:458)
at oracle.sysman.gcagent.tmmain.TMMain.main(TMMain.java:447)
2011-10-24 21:35:21,397 [1:3305B9] INFO - Agent exiting with exit code 55
2011-10-24 21:35:21,398 [31:F9C26A76:Shutdown] INFO - *jetty*: Shutdown hook executing
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown SslSelectChannelConnector@0.0.0.0:3872
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown ContextHandler@14d964af@14d964af/emd/lifecycle/main,null

I yet have to find the reason for the empty targets.xml file but sure enough it existed with 0 byes length.

Simple enough I thought, all I need to do is run agentca to repopulate the file. Unfortunately I couldn’t find it.

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was a bit of a let down. Then I decided to create a new targets.xml file and try a resynchronisation of the agent.This is a well hidden menu item so I dedided to show it here:

The only element that went into targets.xml was “<targets />”. This was sufficient to start the agent, which is a requirement for the resynchronisation to succeed. I was quite amazed that this succeeded, but it did:

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was very encouraging, and both agents are now working properly.

Move the EM12c repository database

I have made a little mistake creating a RAC database for the OEM 12c repository-I now need a little more lightweight solution, especially since I’m going to do some fancy failover testing with this cluster soon! An 11.2.0.3 single instance database without ASM, that’s what I’ll have!

Now how to move the repository database? I have to admit I haven’t done this before, so the plan I came up with is:

  1. Shut down the OMS
  2. Create a backup of the database
  3. Transfer the backup to the destination host
  4. Restore database
  5. Update OEM configuration
  6. Start OMS

Continue reading

Installing Oracle Enterprise Manager 12c on OL 5.7

I have been closely involved in the upgrade discussion of my current customer’s Enterprise Managers setup from an engineering point of view. The client uses OEM extensively for monitoring, alerts generated by it are automatically forwarded to an IBM product called Netcool.

Now some of the management servers are still on 10.2.0.5 in certain regions, and for a private cloud project I was involved in an 11.1 system was needed.The big question was: wait for 12.1 or upgrade to 11.1?

So to cut a long story short I have been very keen to get to the OEM 12c beta programme, but unfortunately wasn’t able to make it. Also, I wasn’t at Open World this year which means I didn’t get to see any of the demos. You can imagine I was quite curious to get my hands on it, and when it has been released a few days ago I downloaded it to my lab machine. I created a new domU for the database-11.2.0.2 plus latest PSU and another one for the management server. I assigned 2 CPUs each, the database server got 2G of memory while the OMS received 8.Don’t take this as a recommendation though, it’s only for lab use! I wouldn’t use less than 24G of memory for a production management server, and it would obviously follow the MAA recommendations and be installed behind an enterprise grade load balancer etc. Needless to say I’d use RAC+Data Guard for the repository database.

Continue reading

Cluster callouts to create blackouts in EM

Finally I got around to providing a useful example for a cluster callout script. It is actually on the verge of taking too long-remember that scripts in the $GRID_HOME/racg/usrco/ directory should execute quickly. Before deploying this, you should definitely ensure that the script executes quickly enough-the “time” utility can help you with this. Nevertheless this has been necessary to work around a limitation of Grid Control: RAC One Node databases are not supported in GC 11.1 (I complained about that earlier).

The Problem

To work around the problem I wrote a script which can alleviate one of the arising problems: when using srvctl relocate database, another instance (usually called dbName_2) will be started to allow existing sessions to survive the failover operation if they use TAF or FAN/FCF.

This poses a big problem to Grid Control though-the second instance didn’t exist when you registered the database as a target, hence GC doesn’t know about it. Subsequently you may get paged that the database is down when in reality it is not. Receiving one of the “false positive” alarms is annoying at best at 02:00 AM in the morning. Actually, Grid Control is right in assuming that the database is down: although detected as a cluster database target, it only consists of 1 instance. If that’s down, it has to be assumed that the whole cluster database is down. In a perfect world we wouldn’t have this problem-GC was aware that the RON database moved to another node in the cluster and update its configuration accordingly. This is planned for the next major release sometime later in 2011. Apparently dbconsole has the ability to deal with such a situation. Continue reading

Automatic log gathering for Grid Control 11.1

Still debugging the OMS problem (it occasionally hangs and has to be restarted) I wrote a small shell script to help me gather all required logs for Oracle support. These are the logs I need for the SR, Niall Litchfield has written a recent blog post about other useful log locations.

The script is basic, and can possibly be extended. However it saved me a lot of time getting all the required information to one place from where I could take it and attach it to the service request. Before uploading I usually zip all files into dd-mm-yyyy-logs.nnn.zip to avoid clashing with logs already uploaded. I run the script via cron daily at 09:30. Continue reading

Quis custodiet ipsos custodies-Nagios monitoring for Grid Control

I have a strange problem with my Grid Control 11.10.1.2 Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.

NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)

So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.

I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing. Continue reading