Martins Blog

Trying to explain complex things in simple terms

Archive for the ‘Grid Control’ Category

Availability Infrastructure & Management SIG March 14th 2012

Posted by Martin Bach on March 3, 2012

I am proud to be able to speak at the first instalment of the Availability, infrastructure and management SIG on March 14th in  the London City office.

The event is announced on the UKOUG website here:

http://www.ukoug.org/events/ukoug-availability-infrastructure-and-management-sig-meeting/

Unfortunately I will be between you and lunch! I hope that works out, and I don’t overrun.

I am going to demonstrate my (little) knowledge of Oracle Enterprise Manager 12.1. I looked at it for one of my customers and came to like it. As it is very different from the previous versions of the product, a more closely focused session seems appropriate. An Internet connection permitting, I am going to demonstrate navigation through the new interface, self update, target discovery and if time permits, I will patch a single instance HA environment (also known as Oracle Restart).

If all demos work this could be quite an entertaining sessions, questions are welcome!

Posted in Grid Control, Linux, Public Appearances, Xen | 4 Comments »

Troubleshooting Oracle agent 12.1.0.1.0

Posted by Martin Bach on October 28, 2011

As you may have read on this blog I recently moved from Oracle Enterprise Manager 11.1 GRID control to the full control of the cloud-12.1 has taken its place in the lab.

I also managed to install agents via self download (my OEM is x86 to reduce the footprint) on a 2 node 11.2.0.3 cluster: rac11203node1 and rac11203node2. After a catastrophic crash of both nodes followed by a reboot none of the agents wanted to report back to the OMS.

The difference

Oracle 12.1 has a new agent structure: where you used the agent base directory in previous releases to create the AGENT_HOME this now changed. In 11.1 I could specify the agent base to be /u01/app/oracle/product, and OUI would deploy everything in a subdirectory it creates, called agent11g (or agent 10g for 10.2.x).

Now I set the agent base to the same value and installed my agents in parallel, but found that there is no agent12c directory under the base. Instead I found these:

[oracle@rac11203node1 product]$ ls -l
total 48
drwxr-xr-x. 73 oracle oinstall  4096 Oct 27 22:40 11.2.0.3
-rw-rw-r--.  1 oracle oinstall    91 Sep 23 08:52 agentimage.properties
drwxr-xr-x.  6 oracle oinstall  4096 Oct 28 14:57 agent_inst
drwxr-xr-x.  3 oracle oinstall  4096 Oct 15 21:35 core
drwx------.  2 oracle oinstall 16384 Oct 14 21:02 lost+found
drwxr-xr-x.  8 oracle oinstall  4096 Oct 15 21:50 plugins
-rwxr-xr-x.  1 oracle oinstall   223 Oct 15 21:25 plugins.txt
-rw-r--r--.  1 oracle oinstall   298 Oct 15 21:42 plugins.txt.status
drwxr-xr-x.  5 oracle oinstall  4096 Oct 15 21:43 sbin

So it’s all a bit different. The core/ directory contains the agent binaries. The agent_inst directory contains the the sysman directory. This is where all the configuration and state information is stored. In that respect the sysman directory is the same as in 11.1.

Now back to my problem-both agents that previously used to work fine were reported “unavailable”. The agent information is no longer in the setup-agents-management agents.

For 12.1 you need to navigate to setup-agents from the top down menu in the upper right corner.This takes you to the overview page. OK, so I could see the agents weren’t communicating with the OMS.

On the machine I could see this:

[oracle@rac11203node1 log]$ emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : (unknown)
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/app/oracle/product/agent_inst
Agent Binaries     : /u01/app/oracle/product/core/12.1.0.1.0
Agent Process ID   : 13270
Parent Process ID  : 13215
Agent URL          : https://rac11203node1.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-26 18:30:17
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : (none)
Last attempted upload                        : (none)
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 1,858
Size of XML files pending upload(MB)         : 8.05
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:42:47
Last successful heartbeat to OMS             : (none)

---------------------------------------------------------------
Agent is Running and Ready

The settings are correct, I have verified that with another, uploading and otherwise fine agent. I have also secured the agent, and $AGENT_BASE/agent_inst/sysman/log/secure.log as well as the emctl secure agent commands reported normal, successful operation.

Still the stubborn thing doesn’t want to talk to the OMS – in the agent overview page both agents are listed as “unavailable”, but not blocked. When I force an upload, I get this:

[oracle@rac11203node1 log]$ emctl upload
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
EMD upload error:full upload has failed: uploadXMLFiles skipped :: OMS version not checked yet. If this issue persists check trace files for ping to OMS related errors. (OMS_DOWN)

However it’s not down, I can reach it from another agent (which happens to be on the same box as the OMS)

[oracle@oem12oms 12.1.0.1.0]$ $ORACLE_HOME/bin/emctl status agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version      : 12.1.0.1.0
OMS Version        : 12.1.0.1.0
Protocol Version   : 12.1.0.1.0
Agent Home         : /u01/gc12.1/agent/agent_inst
Agent Binaries     : /u01/gc12.1/agent/core/12.1.0.1.0
Agent Process ID   : 2964
Parent Process ID  : 2910
Agent URL          : https://oem12oms.localdomain:3872/emd/main/
Repository URL     : https://oem12oms.localdomain:4901/empbs/upload
Started at         : 2011-10-15 21:00:37
Started by user    : oracle
Last Reload        : (none)
Last successful upload                       : 2011-10-27 15:46:38
Last attempted upload                        : 2011-10-27 15:46:38
Total Megabytes of XML files uploaded so far : 0
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 49.16%
Collection Status                            : Collections enabled
Last attempted heartbeat to OMS              : 2011-10-27 15:48:34
Last successful heartbeat to OMS             : 2011-10-27 15:48:34

---------------------------------------------------------------
Agent is Running and Ready

And no, the firewall is turned off and I can connect to the upload from any machine in the network:

[oracle@rac11203node1 log]$ wget --no-check-certificate https://oem12oms.localdomain:4901/empbs/upload
--2011-10-27 15:55:46-- https://oem12oms.localdomain:4901/empbs/upload
Resolving oem12oms.localdomain... 192.168.99.28
Connecting to oem12oms.localdomain|192.168.99.28|:4901... connected.
WARNING: cannot verify oem12oms.localdomain’s certificate, issued by “/O=EnterpriseManager on oem12oms.localdomain/OU=EnterpriseManager on oem12oms.localdomain/L=EnterpriseManager on oem12oms.localdomain/ST=CA/C=US/CN=oem12oms.localdomain”:
Self-signed certificate encountered.
HTTP request sent, awaiting response... 200 OK
Length: 314 [text/html]
Saving to: “upload.1”

100%[======================================>] 314 --.-K/s in 0s

2011-10-27 15:55:46 (5.19 MB/s) - “upload.1” saved [314/314]

The agent complains about this in gcagent.log:

2011-10-27 15:56:08,947 [37:3F09CD9C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,471 [167:E3E93C4C] WARN – improper ping interval (EM_PING_NOTIF_RESPONSE: BACKOFF::180000)
2011-10-27 15:56:18,472 [167:E3E93C4C] WARN – Ping protocol error
o.s.gcagent.ping.PingProtocolException [OMS sent an invalid response: “BACKOFF::180000”]

At least someone in Oracle has some humour when it comes to this.

The Solution

Now I dug around a lot more and finally managed to get to the conclusion. It was actually a two fold problem. The first agent was simply blocked. After finding a way to unblock it, it worked happily.

The second agent was a bit more trouble. I unblocked it as well from the agent page in OEM, which failed. As it turned out the agent was shut down. And it didn’t start either:

[oracle@rac11203node2 12.1.0.1.0]$ emctl start agent
Oracle Enterprise Manager 12c Cloud Control 12.1.0.1.0
Copyright (c) 1996, 2011 Oracle Corporation.  All rights reserved.
Starting agent ............. failed.
Target Metadata Loader failed at Startup
Consult the log files in: /u01/app/oracle/product/agent_inst/sysman/log

I checked the logs and found this interesting bit of information:

2011-10-24 21:35:21,387 [1:3305B9] INFO - Plugin oracle.sysman.oh is now active
2011-10-24 21:35:21,393 [1:3305B9] INFO - Plugin oracle.sysman.db is now active
2011-10-24 21:35:21,396 [1:3305B9] WARN - Agent failed to Startup for Target Metadata Loader in step 2
oracle.sysman.gcagent.metadata.MetadataLoadingException: The targets.xml file is empty
at oracle.sysman.gcagent.metadata.MetadataManager$Loader.validateMetadataFile(MetadataManager.java:799)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.processMDFile(MetadataManager.java:1733)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.readRegistry(MetadataManager.java:1695)
at oracle.sysman.gcagent.metadata.MetadataManager$RegistryLoader.load(MetadataManager.java:1641)
at oracle.sysman.gcagent.metadata.MetadataManager.load(MetadataManager.java:282)
at oracle.sysman.gcagent.metadata.MetadataManager.runStartupStep(MetadataManager.java:450)
at oracle.sysman.gcagent.metadata.MetadataManager.tmNotifier(MetadataManager.java:337)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeNotifier(TMComponentSvc.java:876)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.invokeInitializationStep(TMComponentSvc.java:959)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.doInitializationStep(TMComponentSvc.java:800)
at oracle.sysman.gcagent.tmmain.lifecycle.TMComponentSvc.notifierDriver(TMComponentSvc.java:740)
at oracle.sysman.gcagent.tmmain.TMMain.startup(TMMain.java:215)
at oracle.sysman.gcagent.tmmain.TMMain.agentMain(TMMain.java:458)
at oracle.sysman.gcagent.tmmain.TMMain.main(TMMain.java:447)
2011-10-24 21:35:21,397 [1:3305B9] INFO - Agent exiting with exit code 55
2011-10-24 21:35:21,398 [31:F9C26A76:Shutdown] INFO - *jetty*: Shutdown hook executing
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown SslSelectChannelConnector@0.0.0.0:3872
2011-10-24 21:35:21,399 [31:F9C26A76] INFO - *jetty*: Graceful shutdown ContextHandler@14d964af@14d964af/emd/lifecycle/main,null

I yet have to find the reason for the empty targets.xml file but sure enough it existed with 0 byes length.

Simple enough I thought, all I need to do is run agentca to repopulate the file. Unfortunately I couldn’t find it.

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was a bit of a let down. Then I decided to create a new targets.xml file and try a resynchronisation of the agent.This is a well hidden menu item so I dedided to show it here:

The only element that went into targets.xml was “<targets />”. This was sufficient to start the agent, which is a requirement for the resynchronisation to succeed. I was quite amazed that this succeeded, but it did:

[oracle@rac11203node2 emd]$ find /u01/app/oracle/product/ -name "agentca*"
[oracle@rac11203node2 emd]$

This was very encouraging, and both agents are now working properly.

Posted in Grid Control, Linux, War Stories | 2 Comments »

Move the EM12c repository database

Posted by Martin Bach on October 17, 2011

I have made a little mistake creating a RAC database for the OEM 12c repository-I now need a little more lightweight solution, especially since I’m going to do some fancy failover testing with this cluster soon! An 11.2.0.3 single instance database without ASM, that’s what I’ll have!

Now how to move the repository database? I have to admit I haven’t done this before, so the plan I came up with is:

  1. Shut down the OMS
  2. Create a backup of the database
  3. Transfer the backup to the destination host
  4. Restore database
  5. Update OEM configuration
  6. Start OMS

Read the rest of this entry »

Posted in Grid Control, Linux, Xen | 4 Comments »

Installing Oracle Enterprise Manager 12c on OL 5.7

Posted by Martin Bach on October 7, 2011

I have been closely involved in the upgrade discussion of my current customer’s Enterprise Managers setup from an engineering point of view. The client uses OEM extensively for monitoring, alerts generated by it are automatically forwarded to an IBM product called Netcool.

Now some of the management servers are still on 10.2.0.5 in certain regions, and for a private cloud project I was involved in an 11.1 system was needed.The big question was: wait for 12.1 or upgrade to 11.1?

So to cut a long story short I have been very keen to get to the OEM 12c beta programme, but unfortunately wasn’t able to make it. Also, I wasn’t at Open World this year which means I didn’t get to see any of the demos. You can imagine I was quite curious to get my hands on it, and when it has been released a few days ago I downloaded it to my lab machine. I created a new domU for the database-11.2.0.2 plus latest PSU and another one for the management server. I assigned 2 CPUs each, the database server got 2G of memory while the OMS received 8.Don’t take this as a recommendation though, it’s only for lab use! I wouldn’t use less than 24G of memory for a production management server, and it would obviously follow the MAA recommendations and be installed behind an enterprise grade load balancer etc. Needless to say I’d use RAC+Data Guard for the repository database.

Read the rest of this entry »

Posted in Grid Control, Linux, Xen | Tagged: | 4 Comments »

Cluster callouts to create blackouts in EM

Posted by Martin Bach on March 3, 2011

Finally I got around to providing a useful example for a cluster callout script. It is actually on the verge of taking too long-remember that scripts in the $GRID_HOME/racg/usrco/ directory should execute quickly. Before deploying this, you should definitely ensure that the script executes quickly enough-the “time” utility can help you with this. Nevertheless this has been necessary to work around a limitation of Grid Control: RAC One Node databases are not supported in GC 11.1 (I complained about that earlier).

The Problem

To work around the problem I wrote a script which can alleviate one of the arising problems: when using srvctl relocate database, another instance (usually called dbName_2) will be started to allow existing sessions to survive the failover operation if they use TAF or FAN/FCF.

This poses a big problem to Grid Control though-the second instance didn’t exist when you registered the database as a target, hence GC doesn’t know about it. Subsequently you may get paged that the database is down when in reality it is not. Receiving one of the “false positive” alarms is annoying at best at 02:00 AM in the morning. Actually, Grid Control is right in assuming that the database is down: although detected as a cluster database target, it only consists of 1 instance. If that’s down, it has to be assumed that the whole cluster database is down. In a perfect world we wouldn’t have this problem-GC was aware that the RON database moved to another node in the cluster and update its configuration accordingly. This is planned for the next major release sometime later in 2011. Apparently dbconsole has the ability to deal with such a situation. Read the rest of this entry »

Posted in 11g Release 2, Grid Control, Linux, RAC | Tagged: , , , , , | Leave a Comment »

Automatic log gathering for Grid Control 11.1

Posted by Martin Bach on March 1, 2011

Still debugging the OMS problem (it occasionally hangs and has to be restarted) I wrote a small shell script to help me gather all required logs for Oracle support. These are the logs I need for the SR, Niall Litchfield has written a recent blog post about other useful log locations.

The script is basic, and can possibly be extended. However it saved me a lot of time getting all the required information to one place from where I could take it and attach it to the service request. Before uploading I usually zip all files into dd-mm-yyyy-logs.nnn.zip to avoid clashing with logs already uploaded. I run the script via cron daily at 09:30. Read the rest of this entry »

Posted in Grid Control, solaris | Leave a Comment »

Quis custodiet ipsos custodies-Nagios monitoring for Grid Control

Posted by Martin Bach on February 28, 2011

I have a strange problem with my Grid Control 11.10.1.2 Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.

NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)

So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.

I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing. Read the rest of this entry »

Posted in Grid Control, Linux | Tagged: , , , , , | Leave a Comment »

GC 11.1 and Monitoring Templates

Posted by Martin Bach on February 24, 2011

Throughout the last 2 weeks I have been working (or better: tried to work) with Grid Control 11.1 as the central monitoring and deployment solution for my current project.

The plan is to use EMGC 11.1 in conjunction with an 8 node cluster to automatically deploy RAC One Node databases. Please don’t ask about RAC One Node-that wasn’t my decision, and as I understand the previous project members only chose this as a poor compromise to keep the operations team happy(-ish)

Besides the fact that the OMS-which runs in a Solaris Zone repeatedly “hangs” and can’t be contacted by emcli or any browser (Bug 11804553)-RAC One Node is NOT SUPPORTED as a target in Grid Control 11.1. It might be supported in GC 12.1 later in 2011. But I digress

The Requirement

The OPS team maintains their own 10.2.0.5 management servers. To allow us to perform some testing with the automatic database deployment without messing with a life OMS, it has been decided to install OEM GC 11.1 with PSU 2 locally on Solaris with a repository database on Linux. We needed GC11.1 to supoprt our 11.2.0.2 cluster.

After the installation of the OMS I tried to export the required management templates from the life OMS (remember it’s 10.2.0.5) and import them into 11.1 to save myself a lot of work.

Export a management template

The export function seems to have been introduced in 10.2.0.3 and it works great. All you need to do it hop on the OMS, and use “emcli” (Enterprise Manager Command Line Interface) to log on and export the template. A sample session is shown here:

  • emcli login -username=yourUserName -password=yourPassword
  • emcli export_template -name=TemplateName -target_type=TargetType -output_file=/path/to/templateName.xml

If you are unsure about template names and targets, you can connect to the repository as sysman and query mgmt_templates:

SQL> SELECT TEMPLATE_NAME,TARGET_TYPE FROM MGMT_TEMPLATES;

And so I happily exported the management templates from the 10.2.0.5 OMS.

The Bad News

Unfortunately, you can’t import non 11.1 templates into an 11.1 OMS. When I tried it I got the following error:

$ emcli import_template -files=”emd.10205.xml”
Monitoring template file emd.10205.xml exported from 10.2.0.5.0 OMS can not be imported to 11.1.0.1.0 OMS

Bugger. Sure enough, the XML file has a version tag:

<?xml version = '1.0' encoding = 'UTF-8'?>
<MonitoringTemplate template_name="Agent Template" target_type="oracle_emd" is_public="0" oms_version="10.2.0.5.0" owner="SYSMAN" xmlns="http://www.oracle.com/DataCenter/MonitoringTemp">
...
</MonitoringTemplate>

The solution is to revert to the bad old times and manually comparing source and destination. A rather laborious and tiresome way of getting information across. Don’t forget to export the completed template from 11.1 to save yourself from going through that again.

Posted in Grid Control | Leave a Comment »

Error message of the day: OUI-25023 and the FQDN

Posted by Martin Bach on February 15, 2011

It’s been a long day with many problems around a Grid Control installation, including (but not limited to) corruption of the repository database, bugs in OUI when it comes to deinstalling the Oracle Management Server, lots of files left over by the weblogic “uninstall.sh” script and many more. Some of the error messages were quite misleading, and OUI-25023 just was one too many. What happened?

Earlier today I was trying to install the 64bit 11.1.0.1 agent on an 8 node cluster. After an initial headache (see below) it worked ok. However, I couldn’t resist mentioning OUI-25023. Here’s the complete story.

I downloaded the 11.1 agent for linux x86-64 as per the GC 11.1 documentation and deployed it to my fresh-installed management server. The OMS is on Solaris SPARC, and Grid Control doesn’t supply agents for a different platform. However, the security experts have locked the oracle account down on the cluster which ruled out the “agent push” scenario. I then opted for the installation via a response file, as described in the documentation. Read the rest of this entry »

Posted in Grid Control, Linux | 1 Comment »

Running Grid Control Agent commands standalone

Posted by Martin Bach on February 14, 2011

I had an error message today from one of my grid agents which was cut short in the GUI just when it became interesting. So I thought of a way of running the command on the comand line to get the full output.

This has been a little easier than I thought. I based my approach on an earlier blog article on my knowledgebase to get the perl environment variables set. I then needed to figure out where some of the libraries (perl scripts ending in *.pm) the agent script are referring were located.

A simple “locate -i *pm | grep $ORACLE_HOME” did it. This enabled me to write a preliminary script to run an EM agent task, shown below. It expects that you have ran “oraenv” previously to set the environment to the AGENT_HOME. When referring to ORACLE_HOME in the following, the AGENT_HOME is meant. It takes the full parameter to the script to be executed as the parameter and checked for ORACLE_HOME and $1 to exist. Read the rest of this entry »

Posted in Grid Control | Leave a Comment »