Martins Blog

Trying to explain complex things in simple terms

11.1 GC agent refuses to start

Posted by Martin Bach on February 14, 2011

This is a follow-up post from my previous tale of how not to move networks for RAC. After having successfully restarted the cluster as described here in a previous post I went on to install a Grid Control 11.1 system. This was to be on Solaris 10 SPARC-why SPARC? Not my platform of choice when it comes to Oracle software, but my customer has a huge SPARC estate and wants to make most of it.

After the OMS has been built (hopefully I’ll find time to document this as it can be quite tricky on SPARC) I wanted to secure the agents on my cluster against it. That worked ok for the first node:

  • emctl clearstate agent
  • emctl secure agent
  • emctl start agent

Five minutes later the agent appeared in my list of agents in the Grid Control console. With this success backing me I went to do the same on the next cluster node.

Here things were different-here’s the sequence of commands I used:

$ emctl stop agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved
$

I didn’t pay too much attention to the fact that there has been no acknowledgement of the completion of the stop command. I noticed something wasn’t quite right when I tried to get the agent’s status:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
emctl stop agent
Error connecting to https://node2.example.com:3872/emd/main

Now that should have reported that the agent was down. Strange. I tried a few more commands,  such as the following one to start the agent.

[agent]oracle@node2.example.com $ emctl start agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already running

Which wasn’t the case-there was no agent process whatsoever in the process table. I also checked the emd.properties file. Note that the emd.properties file is in $AGENT_HOME/hostname/sysman/config/ now instead of $AGENT_HOME/sysman/config as it was in 10g.

Everything looked correct, and even a comparison with the first node didn’t reveal any discrepancy. So I scratched my head a little more until I found a MOS note on the subject stating that the agent cannot listen to multiple addresses. The note is for 10g only and has the rather clumsy title “Grid Control Agent Startup: “emctl start agent” Command Returns “Agent is already running” Although the Agent is Stopped (Doc ID 1079424.1)

Although stating it’s for 10g and multiple NICs it got me thinking. And indeed, the /etc/hosts file has not been updated, leaving the old cluster address in /etc/hosts while the new one was in DNS.

# grep node2 /etc/hosts
10.x.x4.42            node2.example.com node2
172.x.x.1x8          node2-priv.example.com node2-priv
# host node2.example.com
node2.example.com has address 10.x5.x8.3
[root@node2 ~]# grep ^hosts /etc/nsswitch.conf
hosts:      files dns

This also explained why the agent started on the first node-it had an updated /etc/hosts file. Why the other nodes didn’t have their hosts file updated will forever remain a mystery.

Things then changed dramatically after the hosts file has been updated:

$ emctl status agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running

Note how emctl acknowledges that the agent it down now. I successfully secured and started the agent:

$ emctl secure agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already stopped...   Done.
Securing agent...   Started.
Enter Agent Registration Password :
Securing agent...   Successful.

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running
$ emctl start agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Starting agent .............. started.

One smaller problem remained:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:00:13
Last successful upload                       : 2011-02-14 10:00:19
Total Megabytes of XML files uploaded so far :    11.56
Number of XML files pending upload           :      188
Size of XML files pending upload(MB)         :    65.89
Available disk space on upload filesystem    :    60.11%
Collection Status                            : Disabled by Upload Manager
Last successful heartbeat to OMS             : 2011-02-14 10:00:17
---------------------------------------------------------------
Agent is Running and Ready

The message in red highlights the “Disabled by Upload Manager”. That’s because a lot of stuff hasn’t been transferred yet. Let’s force an upload-I know the communication between agent and OMS is working, so that should resolve the issue.

$ emctl upload
$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:02:12
Last successful upload                       : 2011-02-14 10:02:53
Total Megabytes of XML files uploaded so far :    91.12
Number of XML files pending upload           :       22
Size of XML files pending upload(MB)         :     1.50
Available disk space on upload filesystem    :    60.30%
Last successful heartbeat to OMS             : 2011-02-14 10:02:19
---------------------------------------------------------------
Agent is Running and Ready

That’s about it-a few minutes later the agent was visible on the console. Now that only had to be repeated for all remaining 6 nodes…

NB: For the reasons shown in this article I don’t endorse duplicating host information in /etc/hosts and DNS-a resilient DNS infrastructure should always be used to store this kind of information.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: