Martins Blog

Trying to explain complex things in simple terms

Cluster callouts to create blackouts in EM

Posted by Martin Bach on March 3, 2011

Finally I got around to providing a useful example for a cluster callout script. It is actually on the verge of taking too long-remember that scripts in the $GRID_HOME/racg/usrco/ directory should execute quickly. Before deploying this, you should definitely ensure that the script executes quickly enough-the “time” utility can help you with this. Nevertheless this has been necessary to work around a limitation of Grid Control: RAC One Node databases are not supported in GC 11.1 (I complained about that earlier).

The Problem

To work around the problem I wrote a script which can alleviate one of the arising problems: when using srvctl relocate database, another instance (usually called dbName_2) will be started to allow existing sessions to survive the failover operation if they use TAF or FAN/FCF.

This poses a big problem to Grid Control though-the second instance didn’t exist when you registered the database as a target, hence GC doesn’t know about it. Subsequently you may get paged that the database is down when in reality it is not. Receiving one of the “false positive” alarms is annoying at best at 02:00 AM in the morning. Actually, Grid Control is right in assuming that the database is down: although detected as a cluster database target, it only consists of 1 instance. If that’s down, it has to be assumed that the whole cluster database is down. In a perfect world we wouldn’t have this problem-GC was aware that the RON database moved to another node in the cluster and update its configuration accordingly. This is planned for the next major release sometime later in 2011. Apparently dbconsole has the ability to deal with such a situation.Now with the background explained, management had to weigh the possibilities-either not register the RAC One database in Grid Control and have no monitoring at all or to bite the bullet and have monitoring only when the initial instance is started on the primary node. The decision was made to have (limited) monitoring. To prevent the DBA from being woken up I developed the simple script below to automatically create a blackout in GC if the “_2” instance starts. Subsequently, the blackout is taken off when the “_1” instance starts.

Room for improvement: if the script assumes that a RON database can only have a maximum of 2 member servers-if your database can run on more than 2 nodes then you should use the relocate_target if the _1 instance comes up on a different node from what GC expects.

The Script

My algorithm checks for cluster events, and if an instance dbName_2 starts, I create a blackout on the initial instance to prevent being paged until Oracle have come up with a better solution (we are flying blind once the 2nd database instance has started).

The script assumes that you have deployed emcli on each cluster node (or ACFS). EMCLI is the Enterprise Manager Command Line Lnterface, it’s located on your OMS together with the installation instructions. This is the default location:: https://oms.example.com:7799/em/console/emcli/download – 7799 is the default port for Grid Control.

Let’s have a look at the script:

#!/bin/bash

# enable debugging if needed
set -x
exec >> /tmp/autoBlackout.log 2>&1

EVENTTYPE=$1

# only SERVICEMEMBER populates instance, database and service as needed
# for the blackout section below.
if [ "$EVENTTYPE" != "SERVICEMEMBER" ]; then
 exit 0
fi

# adjust to your needs or set to the empty string if you are not using db_domain
# assumes that both database and service have the same domain
DOMAIN=example.com

# bail out if there are too many instances of this script running. Inform the
# admin via email
ME=`basename $0`
RUNNING=`ps -ef | grep -v grep | grep $ME | wc -l`
if [ $RUNNING -ge 6 ]; then
 echo Too many instances of this script running, aborting
 echo Too many instances of $ME running, aborting |
 mail -s "$RUNNING instances of $ME detected on `hostname`" admin@example.com
fi

# set up for emcli (emcli requires jdk 1.6)
JAVA_HOME=/shared/acfs/emcli/jdk1.6.0_24
PATH=$JAVA_HOME/bin:$PATH
EMCLI=/shared/acfs/emcli/emcli
export JAVA_HOME PATH

# turn off debugging for a moment - the below parsing of the command line
# parameters is very verbose.
set +x

# read the parameters passed to us-modified version of a script
# found at rachelp.nl
for ARGS in $*;
 do
 PROPERTY=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $1}'`
 VALUE=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $2}'`
 case $PROPERTY in
 VERSION|version)    VERSION=$VALUE ;;
 SERVICE|service)    SERVICE=$VALUE ;;
 DATABASE|database)    DATABASE=$VALUE ;;
 INSTANCE|instance)    INSTANCE=$VALUE ;;
 HOST|host)        HOST=$VALUE ;;
 STATUS|status)        STATUS=$VALUE ;;
 REASON|reason)        REASON=$VALUE ;;
 CARD|card)        CARDINALITY=$VALUE ;;
 TIMESTAMP|timestamp)    LOGDATE=$VALUE ;;
 ??:??:??)        LOGTIME=$VALUE ;;
 esac
 done

# and turn debugging on again
set -x

# targets are reported in lower case :( Someone please suggest a better
# way to get a lower case string to upper case
DATABASE=`echo $DATABASE | tr "[a-z]" "[A-Z]"`

# targets affected are rac_database and the oracle_database (instance)
# not using emcli here as it has to be quick. A rac_database target is a
# composite target, consisting of multiple oracle_database targets. In
# RAC One Node there is only one 1 instance - see output from GC below:
# $ emcli get_targets | grep "RAC"
# 0       Down           oracle_database  RAC.example.com_RAC_1
# 0       Down           rac_database     RAC.example.com

# define what we want to black out (only ever the primary instance!)
BLACKOUT_NAME=blackout_${DATABASE}
BLACKOUT_TARGETS="$DATABASE.${DOMAIN}:rac_database;${DATABASE}.${DOMAIN}_${DATABASE}_1:oracle_database"

# create a blackout if the secondary instance is up (we only ever register the _1 instance)
# the blackout duration is indefinite-it will be stopped and lifted automatically. You may
# want to limit this to a few hours to raise visibility.
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_2" ]]; then
 echo create blackout
 $EMCLI login -username=<em>user</em> -password=<em>supersecretpassword</em>
 $EMCLI create_blackout -name=${BLACKOUT_NAME} -add_targets=${BLACKOUT_TARGETS} \
 -reason="auto blackout" -schedule="frequency:once;duration:-1"
fi

# disable the blackout if instance *_1 starts
# this is where the script could be improved if the RON database can run on more
# than 2 nodes. You could use emcli relocate_target to relocate the target to another
# node
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_1" ]]; then
 echo remove blackout
 $EMCLI login -username=<em>user</em>-password=<em>supersecretpassword</em>
 $EMCLI stop_blackout -name=${BLACKOUT_NAME}
 $EMCLI delete_blackout -name=${BLACKOUT_NAME}
fi

I tried to add a lot of comments to the script, which should make it easy for you to adjust it. I recommend you store it in ACFS and mount that directory on all cluster nodes. Create a symbolic link from the ACFS to $GRID_HOME/racg/usrco/ to make maintenance easier. You could enable log rotation for the logfile in /tmp if you liked, otherwise keep an eye on it so it doesn’t grow to gigabytes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: