Author Archives: Martin Bach

About Martin Bach

Oracle DBA and Linux enthusiast, part time author and presenter.

Little things worth knowing: OSWatcher Analyser Dashboard

I have written a few articles about Tracefile Analyzer (TFA) in the recent past. As you may recall from these posts, a more comprehensive TFA version than the one provided with the installation media is available from My Oracle Support (MOS for short). As part of this version, you get a lot of useful, additional tools, including OSWatcher. I love OSWatcher, simply because it gives me insights that would be very hard to get with sar for example. SAR tends to be the least common denominator on most (proper) operating systems and it’s better than nothing, but please read on and let me explain why I like OSWatcher so much.

Feel free to head back to my earlier posts if you like to get some more details about “stock TFA” and “MOS TFA”. These are terms I coined by the way, you won’t find them in a MOS search.

Say hello to the OSWatcher Analyser dashboard

In this article I’d like to introduce the OSWatcher Analyser dashboard to you. I think it’s probably the best way (certainly a very convenient one) to get an overview of what is going on at the operating system level. And it includes pretty pictures that are so easy to understand! OSWatcher (more precisely, oswbb) should be running on the system as part of “MOS Tracefile Analyzer”. In other words, it has to be installed.

The environment for this post

As always, before beginning the write-up, here are some details about the system I’m using. My virtual RAC system is based on KVM, and I’m running Oracle Linux 7.4 with UEK 4. The database reasonably current at version 18.3.0.0.180717 (but that shouldn’t matter as this post isn’t about the database!), and I have upgraded TFA to 18.3.3.0.0. This is the latest TFA version at the time of writing.

[oracle@rac18pri1 ~]$ tfactl print version
TFA Version : 18.3.3.0.0

[oracle@rac18pri1 ~]$ tfactl toolstatus

.------------------------------------------------------------------.
|                  TOOLS STATUS - HOST : rac18pri1                 |
+----------------------+--------------+--------------+-------------+
| Tool Type            | Tool         | Version      | Status      |
+----------------------+--------------+--------------+-------------+
| Development Tools    | orachk       |   12.2.0.1.3 | DEPLOYED    |
|                      | oratop       |       14.1.2 | DEPLOYED    |
+----------------------+--------------+--------------+-------------+
| Support Tools Bundle | darda        | 2.10.0.R6036 | DEPLOYED    |
|                      | oswbb        |        8.1.2 | RUNNING     |
|                      | prw          | 12.1.13.11.4 | NOT RUNNING |
+----------------------+--------------+--------------+-------------+
| TFA Utilities        | alertsummary |   12.2.1.1.0 | DEPLOYED    |
|                      | calog        |   12.2.0.1.0 | DEPLOYED    |
|                      | dbcheck      |   18.3.0.0.0 | DEPLOYED    |
|                      | dbglevel     |   12.2.1.1.0 | DEPLOYED    |
|                      | grep         |   12.2.1.1.0 | DEPLOYED    |
|                      | history      |   12.2.1.1.0 | DEPLOYED    |
|                      | ls           |   12.2.1.1.0 | DEPLOYED    |
|                      | managelogs   |   12.2.1.1.0 | DEPLOYED    |
|                      | menu         |   12.2.1.1.0 | DEPLOYED    |
|                      | param        |   12.2.1.1.0 | DEPLOYED    |
|                      | ps           |   12.2.1.1.0 | DEPLOYED    |
|                      | pstack       |   12.2.1.1.0 | DEPLOYED    |
|                      | summary      |   12.2.1.1.0 | DEPLOYED    |
|                      | tail         |   12.2.1.1.0 | DEPLOYED    |
|                      | triage       |   12.2.1.1.0 | DEPLOYED    |
|                      | vi           |   12.2.1.1.0 | DEPLOYED    |
'----------------------+--------------+--------------+-------------'

Note :-
  DEPLOYED    : Installed and Available - To be configured or run interactively.
  NOT RUNNING : Configured and Available - Currently turned off interactively.
  RUNNING     : Configured and Available.

[oracle@rac18pri1 ~]$ 

As you can see, oswbb is part of the “Support Tools Bundle” and actively running out of the box.

Apart from OSWatcher.sh it is important for OSWatcherFM.sh to run as well, as it clears the archive directory where OSWatcher stores its results. Of course that doesn’t relieve you from keeping an eye on your filesystem usage numbers :) Both of these are started by default.

Invoking OSWBB

Assuming there is data for analysis (eg OSWatcher has run for a bit), you can invoke oswbb via tfactl. It will go through a somewhat lengthy parse phase if you don’t narrow your search down to a specific time, but eventually presents you with options:

[oracle@rac18pri1 ~]$ tfactl oswbb

Starting OSW Analyzer V8.1.2
OSWatcher Analyzer Written by Oracle Center of Expertise
Copyright (c)  2017 by Oracle Corporation

Parsing Data. Please Wait...

Scanning file headers for version and platform info...

Parsing file rac18pri1_iostat_18.10.10.0600.dat ...
Parsing file rac18pri1_iostat_18.10.10.0700.dat ...
Parsing file rac18pri1_iostat_18.10.10.0800.dat ...

[...]

Parsing file rac18pri1_ps_18.10.12.0400.dat ...
Parsing file rac18pri1_ps_18.10.16.1900.dat ...
Parsing file rac18pri1_ps_18.10.16.2000.dat ...


Parsing Completed.


Enter 1 to Display CPU Process Queue Graphs
Enter 2 to Display CPU Utilization Graphs
Enter 3 to Display CPU Other Graphs
Enter 4 to Display Memory Graphs
Enter 5 to Display Disk IO Graphs

Enter GC to Generate All CPU Gif Files
Enter GM to Generate All Memory Gif Files
Enter GD to Generate All Disk Gif Files
Enter GN to Generate All Network Gif Files

Enter L to Specify Alternate Location of Gif Directory
Enter Z to Zoom Graph Time Scale (Does not change analysis dataset)
Enter B to Returns to Baseline Graph Time Scale (Does not change analysis dataset)
Enter R to Remove Currently Displayed Graphs

Enter X to Export Parsed Data to Flat File
Enter S to Analyze Subset of Data(Changes analysis dataset including graph time scale)
Enter A to Analyze Data
Enter D to Generate DashBoard

Enter Q to Quit Program

Please Select an option:

As you can see, there are plenty of options available. In the interactive mode you are currently using it’s possible to display all sorts of charts for CPU, memory, disk, and network.

Another option – and one I like a lot, is to generate all the data in a dashboard.

Creating the dashboard

If you choose option “D” oswbb will go ahead and create a dashboard. You are prompted to enter a name before it goes off and creates it. The location where it saves it is not immediately obvious … In my case – using RAC – the result is stored in $ORACLE_BASE/tfa/repository/suptools/$(hostname)/oswbb/oracle/oswbb/analysis

I suppose that’s the same location in Oracle Restart and single instance, although I haven’t been able to verify that claim.

Within the “analysis” directory you find all your information combined in a subdirectory. It has the same name you assigned earlier when you requested the dashboard generation. It’s probably easiest to transfer the entire directory to a workstation for analysis.

Within the top level directory you find 2 files – analysis.txt as well as another directory containing the HTML representation of the data gathered.

[martin@host dashboard1]$ ls -l
total 14588
-rw-r--r-- 1 martin martin 14931077 Oct 16 22:07 analysis.txt
drwxr-xr-x 7 martin martin     4096 Oct 16 22:07 dashboard
[martin@host dashboard1]$

Let’s start with the latter to get a first impression.

The HTML output

I have included a couple of screenshots that you might find interesting about the system under observation. If this was a “real” system and not an undersized set of VMs I’d probably be quite nervous by just looking at this data … there is no reason for concern though, this is just my lab! And I did my best to hurt the VMs, otherwise there wouldn’t be anything to show here :)

overview

As you can see, all the main components are colour-coded. A number of important charts are presented as well. CPU looks particularly bad. You can enlarge the images by clicking on them.

The dashboard provides you with detailed information for CPU, memory, I/O and networking. I’d like to show you a few of these now.

CPU

Clicking on the red CPU button, I get more details about CPU utilisation on the system.

cpu

There are actually a few more charts, but they didn’t fit on the screen. I like the fact that I am immediately presented with critical findings at the top of the page, and there is more information in form of charts. Again, clicking on the chart provides a larger picture.

Heading over to the text file (or by clicking on the “details” button not shown in the figure) I mentioned earlier I get more details:

############################################################################
# Section 4.2: CPU UTILIZATION: PERCENT BUSY
# CPU utilization should not be high over long periods of time. The higher 
# the cpu utilization the longer it will take processes to run.  Below lists 
# the number of times (NUMBER) and percent of the number of times (PERCENT) 
# that cpu percent busy was High (>95%) or Very High (100%). Pay attention 
# to high spanning multiple snaps as this represents the number of times cpu
# percent busy remained high in back to back snapshots
                                       NUMBER  PERCENT
------------------------------------------------------
Snaps captured in archive                 5568   100.00
High (>95%)                                 48     0.86
Very High (100%)                            42     0.75
High spanning multiple snaps                24     0.43

CPU UTILIZATION: The following snaps recorded cpu utilization of 100% busy:
SnapTime                      
------------------------------
Wed Oct 10 11:08:56 BST 2018
Wed Oct 10 11:09:29 BST 2018
Wed Oct 10 20:14:54 BST 2018
[...]

Using these data points I can further drill down into the information gathered by OSWatcher in the archive directory to look at specific output.

Memory

The memory button is also in bright red, so something must be wrong in that area.

memory

The system doesn’t seem to be in best shape. This is entirely my fault: I have assigned too little memory to my VMs, and here is proof!

Changing the scope

After the first investigation using all the data, you may want to narrow the scope down a bit. Or you might already know that an issue was reported last night between 2 AM and 4 AM. Narrowing the scope down for the purpose of creating a (more detailed) dashboard can be done using the “S” flag in the initial menu, as shown here:

Please Select an Option:S

Specify Analysis Start Time. Valid entry between Oct 10 07:00:03 2018 and Oct 16 21:31:59 2018
Example Format To Enter Time: Oct 10 07:00:03 2018  :Oct 11 21:00:00 2018

Specify Analysis End Time. Valid entry between Oct 10 07:00:03 2018 and Oct 16 21:31:59 2018
Example Format To Enter Time: Oct 16 21:31:59 2018  :Oct 12 01:00:00 2018

Dates accepted. Verifying valid begin/end data points...

Validating times in the archive...


Recalibrating data...
Scanning file headers for version and platform info...

Parsing file rac18pri1_iostat_18.10.11.2000.dat ...

[...]

Parsing file rac18pri1_ps_18.10.12.0000.dat ...

Enter a unique analysis directory name or enter  to accept default name:

There are other options to narrow the scope down: menu option “Z” allows you to specify a date/time range for interactive use without changing the dataset. In interactive mode you can change the scope using “Z”, followed by the creation of any chart of interest (menu items 1-5). You can reset the “zoom” by pressing “B”.

Summary

Did I say I really like OSWatcher? It’s really helpful. If you install it together with TFA you don’t even need to concern yourself with writing a startup script, it’ll just be started. If you have to, you can go back in time (within reason) and investigate O/S performance statistics. Unlike SAR, it will give you a 30 second granularity, which is a lot better than the rather coarse 10 minute interval.

Admittedly, SAR keeps a month worth of data, so both tools really complement each other quite nicely.

Happy troubleshooting!

Advertisements

Ansible tips’n’tricks: assessing your runtime environment

One thing that I frequently need to do is test for a certain condition, and fail if it is not met. After all, I want to write those playbooks in a safe way.

Here is an example: I need to ensure that my playbook only runs on Oracle Linux 7. How can I do this? Ansible offers a shell and a command module (make sure you read the notes in the command module documentation!), so I could simply write something testing for the output of, let’s say, /etc/os-release.

This is totally possible, although I believe it’s a bit messy and there is a more elegant way requiring far less coding. Ansible maintains a whole raft of variables it gathers when you run a playbook. Here is an example (I am again using the “debug” stdout_callback as described in my earlier blog posts):

[martin@controller ansible]$ ansible-playbook -i inventory.yml test.yml

PLAY [blogpost] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [server1]

TASK [say hello] ***************************************************************
ok: [server1] => {}

MSG:

hello world

PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=0   

When invoking the play “blogpost” (I have defined my host “server1” to be part of a group named “blogposts” in the local inventory file), I can see that the first task is to gather facts. This is unusual, since I haven’t defined any such task:

[martin@controller ansible]$ cat test.yml 
---
- hosts: blogpost

  tasks:
    - name: say hello
      debug:
        msg: hello world

Nevertheless, ansible gathers these facts, and they are most useful. But which of these exist, and what are their values? It is not too hard to find out. The following command invokes the setup module (using a local inventory mapping “server1” to group “blogpost”) specifically against host “server1”:

[martin@controller ansible]$ ansible -i inventory.yml -m setup server1 | head -n 25
server1 | SUCCESS => {
    "ansible_facts": {
        "ansible_all_ipv4_addresses": [
            "192.168.abc.24"
        ],
        "ansible_all_ipv6_addresses": [
            "fe80::5054:ff:fe91:abcd"
        ],
        "ansible_apparmor": {
            "status": "disabled"
        },
        "ansible_architecture": "x86_64",
        "ansible_bios_date": "04/01/2014",
        "ansible_bios_version": "1.10.2-2.fc27",
        "ansible_cmdline": {
            "BOOT_IMAGE": "/vmlinuz-4.1.12-94.3.9.el7uek.x86_64",
            "LANG": "en_US.UTF-8",
            "crashkernel": "auto",
            "quiet": true,
            "rd.lvm.lv": "rootvg/swaplv",
            "rhgb": true,
            "ro": true,
            "root": "/dev/mapper/rootvg-rootlv"
        },
        "ansible_date_time": {

There is a lot more output! You can query pretty much everything using these built-in variables. So what about our distribution? Let’s check the output of the setup module for a string containing the word “distribution”:

[martin@controller ansible]$ ansible -i inventory.yml -m setup server1 | grep -i distribution
        "ansible_distribution": "OracleLinux", 
        "ansible_distribution_file_parsed": true, 
        "ansible_distribution_file_path": "/etc/oracle-release", 
        "ansible_distribution_file_search_string": "Oracle Linux", 
        "ansible_distribution_file_variety": "OracleLinux", 
        "ansible_distribution_major_version": "7", 
        "ansible_distribution_release": "NA", 
        "ansible_distribution_version": "7.5", 

So it looks like there is something we could use here. I can re-write my playbook now to check for Oracle Linux 7 as follows:

[martin@controller ansible]$ cat test2.yml 
---
- hosts: blogpost

  tasks:
    - name: print debuginfo
      debug:
        var: "{{ item }}" 
      with_items:
        - ansible_distribution
        - ansible_distribution_major_version

    - name: ensure we run Oracle Linux 7
      fail:
        msg: This playbook only runs on Oracle Linux 7
      when: (ansible_distribution != "OracleLinux" and ansible_distribution_major_version|int != 7)

This does work all right when running against an Oracle Linux 7 system.

[martin@controller ansible]$ ansible-playbook -i inventory.yml test2.yml

PLAY [blogpost] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [server1]

TASK [print debuginfo] *********************************************************
ok: [server1] => (item=ansible_distribution) => {
    "ansible_distribution": "OracleLinux", 
    "item": "ansible_distribution"
}
ok: [server1] => (item=ansible_distribution_major_version) => {
    "ansible_distribution_major_version": "7", 
    "item": "ansible_distribution_major_version"
}

TASK [ensure we run Oracle Linux 7] ********************************************
skipping: [server1]

PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=0   

The “skipping” message indicates this task [ensure we run Oracle Linux 7] has not fired and we can proceed with life. Pointing this playbook to another environment not running Oracle Linux 7, the play fails:

[martin@controller ansible]$ ansible-playbook -i inventory.yml test2.yml

PLAY [blogpost] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [linuxdev]

TASK [print debuginfo] *********************************************************
ok: [linuxdev] => (item=ansible_distribution) => {
    "ansible_distribution": "Fedora", 
    "item": "ansible_distribution"
}
ok: [linuxdev] => (item=ansible_distribution_major_version) => {
    "ansible_distribution_major_version": "28", 
    "item": "ansible_distribution_major_version"
}

TASK [ensure we run Oracle Linux 7] ********************************************
fatal: [linuxdev]: FAILED! => {
    "changed": false
}

MSG:

This playbook only runs on Oracle Linux 7


PLAY RECAP *********************************************************************
linuxdev                   : ok=2    changed=0    unreachable=0    failed=1

So there we go! This example can easily be extended to check for other things, such as a combination of running UEK on Oracle Linux, etc.

Happy scripting!

Ansible tips’n’tricks: even more output options

In my last post I wrote about the “debug” option to format Ansible output differently. I came across this setting simply by searching the usual developer forums for an alternative Ansible output option.

Having found out about the “debug” option made me curious, especially since there wasn’t an awful lot of documentation available about additional alternatives. Or so It thought before writing this post, there is actually, as you will see later. So to recap what I had so far: I noticed “skippy” in my distribution’s /etc/ansible/ansible.cfg although it is commented out. And I found the “debug” option via my favourite search engine, and there is the “default” as well.

There surely had to be more …

This wasn’t quite good enough for me and I started to wonder if there were more of these callbacks. Here is my Ansible version in case some of these callbacks might be quite recent:

[martin@controller environment]$ ansible --version | head -n 1
ansible 2.6.4

My first idea was to perform a file system search using find – I assumed trying to look out for “debug” or “default” would result in too many irrelevant hits, but “skippy” sounded unique enough a search term.

$ find / -iname "*skippy*" 2> /dev/null
/usr/lib/python2.7/site-packages/ansible/plugins/callback/skippy.py
/usr/lib/python2.7/site-packages/ansible/plugins/callback/skippy.pyc
/usr/lib/python2.7/site-packages/ansible/plugins/callback/skippy.pyo
[martin@linuxdev ~]$ 

This looks like a hit! I opened the first file, skippy.py, and found a documentation section:

DOCUMENTATION = '''
    callback: skippy
    callback_type: stdout
    requirements:
      - set as main display callback
    short_description: Ansible screen output that ignores skipped status
    version_added: "2.0"
    extends_documentation_fragment:
      - default_callback
    description:
        - This callback does the same as the default except it does not output skipped host/task/item status

So there you go, more information about the purpose of this callback! Are there any others that might be useful candidates for stdout_callback? Possibly! Let’s see (still in the callback directory)…

$ grep -i 'type: stdout' *.py
actionable.py:    type: stdout
debug.py:    type: stdout
default.py:    type: stdout
dense.py:type: stdout
full_skip.py:    type: stdout
json.py:    type: stdout
minimal.py:    type: stdout
null.py:    callback_type: stdout
oneline.py:    type: stdout
selective.py:    callback_type: stdout
skippy.py:    callback_type: stdout
stderr.py:    callback_type: stdout
unixy.py:    type: stdout
yaml.py:    type: stdout

Having unearthed this list I eventually found the relevant part of the documentation set! So instead of showing you the DOCUMENTATION variable for each of these callbacks, I point you to the official Ansible 2.6 documentation:

https://docs.ansible.com/ansible/2.6/plugins/callback.html#plugin-list

Happy scripting!

Ansible tips’n’tricks: a different output option

When running ansible scripts, occasionally you wonder why a given task has failed. I found out more than once that it’s commonly a problem with the script, not the engine ;) Finding out exactly where in the script I made the mistake can be more of a challenge.

With the default ansible settings, output can be a bit hard to read. Consider this example: I do quite a bit of patching in my lab, and this almost always requires an upgrade of OPatch (d’oh!). So instead of connecting to each of my hosts and performing the same unzip command over and over again, I thought of using something else. Why not use ansible for this task? It won’t get tired copying/unzipping OPatch to all the destinations I indicate in my configuration. And it won’t introduce a mistake when dealing with the fifth ORACLE_HOME on the third server…

Before replacing $ORACLE_HOME/OPatch with the new version, I want to take a backup of the current OPatch just in case. I don’t want to keep more than 1 backup around in this particular lab environment, so I decided to check for an existing backup first, before creating a new one. If one exists, I remove it. Or at least, that’s the plan.

So I was happily coding away and in my usual trial-and-error approach was ready to test the script I wrote for the first time. Here’s the result (as shown in my 80×24 terminal):

[martin@controller environment]$ ansible-playbook -i inventory.yml broken.yml

PLAY [blogpost] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [server1]

TASK [check if there is an old backup] *****************************************
ok: [server1]

TASK [remove old OPatch backup] ************************************************
fatal: [server1]: FAILED! => {"msg": "The conditional check 'backup_present.exis
ts' failed. The error was: error while evaluating conditional (backup_present.ex
ists): 'dict object' has no attribute 'exists'\n\nThe error appears to have been
 in '/home/martin/ansible/blogpost/environment/broken.yml': line 20, column 11, 
but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe 
offending line appears to be:\n\n\n        - name: remove old OPatch backup\n  
        ^ here\n"}

PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=1   

[martin@controller environment]$ 

It doesn’t really matter what I was trying to do here, what matters though is the somewhat illegible formatting of the output. The listing above really shows how the error displayed in my terminal. I haven’t quite understood yet why there are linebreaks (\n) in the output that don’t result in a carriage return on screen.

So I did a little bit of digging around and found a global setting named stdout_callback. This is usually defined in /etc/ansible/ansible.cfg which would be bad news for developers if we couldn’t override it. Thankfully you can – using $HOME/.ansible.cfg or even an ansible.cfg file in your project directory. Setting stdout_callback to “debug” reveals a much more readable version of the error:

TASK [remove old OPatch backup] ************************************************
fatal: [server1]: FAILED! => {}

MSG:

The conditional check 'backup_present.exists' failed. The error was: error while
 evaluating conditional (backup_present.exists): 'dict object' has no attribute 
'exists'

The error appears to have been in '/home/martin/ansible/blogpost/environment/bro
ken.yml': line 20, column 11, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


        - name: remove old OPatch backup
          ^ here


PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=1

I find this much easier to read, and by setting stdout_callback to a non-default value in my project directory I don’t break anything inadvertently. It also immediately revealed I wasn’t checking backup_exists.stat.exists, I used backup_exists.exists. Pretty-printing the output helped me debug the mistake much quicker. Later on, when your script is ready to be deployed it’s probably a good idea not to use the debug callback ;)

The ansible version in this post is 2.6.4 by the way.

Happy scripting!

Ansible tips’n’tricks: a new series

In the past few months I have spent considerable amounts of time working on new technology (new as in “new to me”) and one of the things I have found a great interest-and more importantly-use cases with, is ansible. I don’t know how it is with you, but if I don’t have a practical use case for using a technology I find it hard to get familiar with it.

Ansible is a fantastic piece of technology with decent documentation and great community support. I didn’t find writing ansible code too hard. Most of the problems I have encountered while learning how to write ansible playbooks have-in some way or another-already been encountered and solved by other users. It does require a bit of Internet research, and you will need check if the solution offered somewhere on the Internet is still valid with the current version of the tool. As with a lot of other software, ansible evolves at a rather quick pace and keeping up can be a bit of a challenge.

I promise not to get into details about what ansible is, others have done so and I’m not a great fan of repeating what more clever people have said. Head over to the official documentation and see for yourself what ansible is. For the most part I consider it pretty straight forward and elegant. It sure beats my attempts at writing bash code ;)

Having said that it might require a minute or so to get used to the markup language; most so-called playbooks are written in YAML and you learn to appreciate the importance of white space in your scripts. Another key concept I personally found quite important to understand is idempotency. A script that is considered to be idempotent can be re-run multiple times safely. There are other important concepts to be aware of, and you should consult the documentation to learn more about these if you aren’t yet familiar with them.

My main reason for using ansible is to automate routine tasks in my personal lab environment, and this little series I’m planning on writing is primarily a “note to self” so I don’t forget how to do things. As with the Oracle posts I plan on writing I’ll add a “tested/written with version x” so later on when someone hits this article series via a search engine what it requires to do the same.

Things I’m planning on writing on are those that cost me a bit more time on <enter your favourite developer forum here> and I’m hoping these little snippets can help you with your automation project.

Creating a RAC 12.1 Data Guard Physical Standby environment (3b)

Huh, what is this I hear you ask? Part 3b? Oracle 12.1? Well, there’s a bit of a story to this post. Back in December 2016 I started to write a series of blog posts (part 1 | part 2 | part 3 | part 4) about how I created a standby database on RAC 12.1. For some reason I forgot to post this part. Up until now the step where I am creating the broker configuration was missing. Thanks to a friend I discovered my mistake. I decided to post this article to complete the series. There’s a pretty big logical gap that needed filling :)

I also deliberately decided against making changes to my notes, so this is written in the same style as the articles preceding this one. Also bear in mind that this is Oracle 12.1!

NOTE: As always, this is just a demonstration using VMs in my lab, based on my notes. Your system is most likely different, so in real-life you might take a different approach. The techniques I am using here were suitable for me, and my own small-scale testing. I tried to make sure they are valid, but you may want to allocate more resources in your environment. Test, test, test on your own environment on test kit first!

Here is the original blog post

In a last-minute decision I decided to split part 3 into half: first the database duplication, followed by the Data Guard configuration. It really got a bit too long (even for my standards), and would have put readers off the article. I appreciate your patience though…

What’s the situation?

At the end of part 3 I had a physical standby database ready for use, including its registration in the OCR. it was looking promising, but there was more work to do.

At the moment, the standby database is out of sync with the primary, a fact that needs to be rectified by creating a Data Guard configuration. I prefer to do this on the command line. If you are so inclined you can of course push the mouse around and do it with OEM.

Step 1: Data Guard Broker pre-requisites

Before creating the configuration you should change the location of the broker configuration files to shared storage. This needs to be done on each side – primary and every standby – before you can start the broker.

SQL> alter system set dg_broker_config_file1='+DATA/NCDBA/dr1NCDBA.dat' scope=both sid='*';

System altered.

SQL> alter system set dg_broker_config_file2='+DATA/NCDBA/dr2NCDBA.dat' scope=both sid='*';

System altered.

SQL> show parameter dg_broker

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
dg_broker_config_file1               string      +DATA/NCDBA/dr1NCDBA.dat
dg_broker_config_file2               string      +DATA/NCDBA/dr2NCDBA.dat
dg_broker_start                      boolean     FALSE

SQL> alter system set dg_broker_start=true scope=both sid='*';

System altered.

SQL> 

I had to repeat the step on NCDBB, of course with adapted paths and file names.

Step 2: Creating a Data Guard Broker configuration

Once that’s done, you create the configuration using dgmgrl, the Data Guard Broker Manager Line Mode tool. I named my configuration RACTEST and added both databases. Here is an example of my session:

[oracle@rac12pri1 ~]$ dgmgrl
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys@ncdba
Password:
Connected as SYSDBA.
DGMGRL> create configuration ractest as primary database is 'NCDBA' connect identifier is 'NCDBA';
Configuration "ractest" created with primary database "NCDBA"
DGMGRL> add database 'NCDBB' as connect identifier is 'NCDBB';
Database "NCDBB" added
DGMGRL> show configuration

Configuration - ractest

  Protection Mode: MaxPerformance
  Members:
  NCDBA - Primary database
    NCDBB - Physical standby database 

Fast-Start Failover: DISABLED

Configuration Status:
DISABLED

As a first step I connect as SYSDBA before creating the configuration. You can usually type help command to remind you of the exact syntax. After the configuration is created you can start adding (standby) databases. In my case there is only 1 – physical standby – database, but you can think of environments where you have more than one standby environment although that is less common. When specifying the connect identifier you use valid net*8 names found in the default tnsnames.ora (the one in the RDBMS home).

At this point it is important to set standby_file_management to auto. Otherwise managed recovery can abort and you will have to do extra work managing data files that can easily be avoided. The broker syntax to do so involves changing database properties (type show database verbose 'NCDBA' to show all of them). Many Data Guard broker properties are not easily mapped back to the corresponding initialisation parameter. In the case of standby_file_management it is not hard, and the documentation helps with the rest.

Eventually I enable the configuration and check the status. I have found that it can take a couple of minutes at times to get the SUCCESS message.

DGMGRL> edit database 'NCDBA' set property standbyfilemanagement=auto;
Property "standbyfilemanagement" updated
DGMGRL> edit database 'NCDBB' set property standbyfilemanagement=auto;
Property "standbyfilemanagement" updated
DGMGRL> enable configuration
Enabled.
DGMGRL> show configuration

Configuration - ractest

  Protection Mode: MaxPerformance
  Members:
  NCDBA - Primary database
    NCDBB - Physical standby database 

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 21 seconds ago)

There are other things you must consider in accordance with your company’s (data protection/availability/business continuity/…) standards and other documentation – again my example is rather basic, but deliberately so. It’s my lab playground after all.

For the sake of completeness, here is the current configuration of NCDBA and NCDBB:

DGMGRL> show database 'NCDBA'

Database - NCDBA

  Role:               PRIMARY
  Intended State:     TRANSPORT-ON
  Instance(s):
    NCDBA1
    NCDBA2

Database Status:
SUCCESS

DGMGRL> show database 'NCDBB';

Database - NCDBB

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 1 second ago)
  Apply Lag:          0 seconds (computed 1 second ago)
  Average Apply Rate: 223.00 KByte/s
  Real Time Query:    OFF
  Instance(s):
    NCDBB1
    NCDBB2 (apply instance)

Database Status:
SUCCESS

DGMGRL> exit

Please note that the broker detected the fact that my systems are cluster databases and lists both instances for each database. It also shows that NCDBB2 is the apply instance. I was positively surprised about the broker’s ability to detect both instances automatically. I seem to remember cases in earlier releases where said instance detection occasionally required a little nudging to get it right.

This example is deliberately kept as minimalist as possible: it works. Most likely it won’t satisfy your requirements. Review your Data Guard (and any other relevant) standards and change the configuration accordingly.

Summary

… back to present time :) This post completes the series, all my notes are now transcribed to the blog. I hope you find this useful.

back to part 3 | on to part 4

RAC One node databases are relocated by opatchauto in 12.2 part 2

In a previous post I shared how I found out that RAC One Node databases are relocated on-line during patching and I promised a few more tests and sharing of implications. If you aren’t familiar with RAC One Node I recommend having a look at the official documentation: The Real Application Clusters Administration and Deployment Guide features it prominently in the introduction. One of the things I like to keep in mind when working with RAC One Node is this, quoting from section 1.3 in said admin and deployment guide:

Using the Single Client Access Name (SCAN) to connect to the database, clients can locate the service independently of the node on which it is running. Relocating an Oracle RAC One Node instance is therefore mostly transparent to the client, depending on the client connection. Oracle recommends to use either Application Continuity and Oracle Fast Application Notification or Transparent Application Failover to minimize the impact of a relocation on the client.

A little later, you can read this important additional piece of information in the “notes” section:

To prepare for all failure possibilities, you must add at least one Dynamic Database Service (Oracle Clusterware-managed database service) to an Oracle RAC One Node database.

If you created your RAC One database with the database creation assistant (dbca), you are already complying with that rule. In my case, my lab database is named RON with the mandatory service RON_SVC. I opted for an instance prefix of “DCB”.

What I’d like to try out for this blog post is what happens to an active workload on a RAC One database during patching.

I have successfully patched multi-node RAC systems, but that required the application to support this procedure. One of my favourite talks is named “Advanced RAC programming techniques” where I demonstrate the resilience of an application based on RAC to node failures by using Universal Connection Pool (UCP), JDBC and Fast Connection Failover (FCF). UCP is a Fast Application Notification (FAN) aware connection pool allowing my application to react to cluster events such as node up/node down. The idea is to mask instance failure from users.

If all of this sounds super-cryptic, I would like to suggest chapters 5 and 6 of the RAC administration and deployment guide for you. There you can read more about Workload Management with Dynamic Database Services and Application Continuity in RAC.

RAC One Node is different from multi-node RAC as it is only ever active on one node in normal operations. Online relocation, as described in the previous post, is supported by temporarily adding a second (destination) instance to the cluster, and moving all transactions across before terminating them after a time-out (default 30m). Once the time-out is reached (or all sessions made it across to the destination) the source instance is shut down and you are back to 1 active instance.

The online relocation does not care too much about the compatibility of the application with the process. If your application is written for RAC, most likely it will migrate quickly from source to destination instance. If it isn’t, well, the hard timeout might kick in and evict a few of your users. In a worse case your users need to re-connect to the database. Even worse still, you might have to restart the middle-tier system(s).

Sadly I haven’t seen too many applications capable of handling RAC events properly. One application that does is Swingbench, so I’ll stick with it. I configured it according to Dominic Giles’s blog post.

This post assumes that you are properly licensed to use all these features.

The environment

My setup hasn’t changed from previous post so I spare you the boredom of repeating it here. Jump over to the other post for details.

Preparations

For this blogpost I need to ensure that my RAC One node database resides on the node I am about to patch. I have again followed the patch readme very carefully, I have made sure that I have (proven, working, tested) backups of the entire stack and all the rest of it…

My database is indeed actively running on the node I am about to patch:

[oracle@rac122sec2 ~]$ srvctl status database -db DCB -detail -verbose
Instance DCB_1 is running on node rac122sec2 with online services RON_SVC. Instance status: Open.
Instance DCB_1 is connected to ASM instance +ASM2
Online relocation: INACTIVE

Before I can benefit from Application Continuity, I need to make a few changes to my application service, RON_SVC. There are quite a few sources to choose from, I went with the JDBC Developer’s Guide. Here’s the modification:

[oracle@rac122sec2 ~]$ srvctl modify service -db DCB -service RON_SVC -commit_outcome true \
> -failoverretry 30 -failoverdelay 10 -failovertype transaction \
> -replay_init_time 1800 -retention 86400 -notification true
[oracle@rac122sec2 ~]

Following the instructions on Dominic Giles’s blog, I also need to grant SOE the right to execute DBMS_APP_CONT.

Finally, I need to make changes to my Swingbench configuration file. The relevant part is shown here:

<SwingBenchConfiguration xmlns="http://www.dominicgiles.com/swingbench/config">
    <Name>"Order Entry (PLSQL) V2"</Name>
    <Comment>Version 2 of the SOE Benchmark running in the database using PL/SQL</Comment>
    <Connection>
        <UserName>soe</UserName>
        <Password></Password>
        <ConnectString>(DESCRIPTION=(TRANSPORT_CONNECT_TIMEOUT=5)
        (RETRY_COUNT=3)(FAILOVER=ON)
        (ADDRESS=(PROTOCOL=TCP)(HOST=rac122sec-scan)(PORT=1521))
        (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=RON_SVC)))</ConnectString>
        <DriverType>Oracle jdbc Driver</DriverType>
        <ConnectionPooling>
            <PooledInitialLimit>5</PooledInitialLimit>
            <PooledMinLimit>5</PooledMinLimit>
            <PooledMaxLimit>10</PooledMaxLimit>
            <PooledInactivityTimeOut>50</PooledInactivityTimeOut>
            <PooledConnectionWaitTimeout>45</PooledConnectionWaitTimeout>
            <PooledPropertyCheckInterval>10</PooledPropertyCheckInterval>
            <PooledAbandonedConnectionTimeout>120</PooledAbandonedConnectionTimeout>
        </ConnectionPooling>
        <Properties>
            <Property Key="StatementCaching">50</Property>
            <Property Key="FastFailover">true</Property>
            <Property Key="FetchSize">20</Property>
            <Property Key="OnsConfiguration">nodes=rac122sec1:6200,rac122sec2:6200</Property>
            <Property Key="FetchSize">20</Property>
            <Property Key="AppContinuityDriver">true</Property>
        </Properties>
    </Connection>
    <Load>
        <NumberOfUsers>5</NumberOfUsers>
        <MinDelay>0</MinDelay>
        <MaxDelay>0</MaxDelay>
        <InterMinDelay>50</InterMinDelay>
        <InterMaxDelay>500</InterMaxDelay>
        <QueryTimeout>120</QueryTimeout>
        <MaxTransactions>-1</MaxTransactions>
        <RunTime>0:0</RunTime>
        <LogonGroupCount>1</LogonGroupCount>
        <LogonDelay>1000</LogonDelay>
        <LogOutPostTransaction>false</LogOutPostTransaction>
        <WaitTillAllLogon>false</WaitTillAllLogon>
        <StatsCollectionStart>0:0</StatsCollectionStart>
        <StatsCollectionEnd>0:0</StatsCollectionEnd>
        <ConnectionRefresh>0</ConnectionRefresh>
        <TransactionList>
...

The connection string is actually on a single line, I have formatted it for readability in the above example. The main change from the standard configuration file is the use of connection pooling and setting the required properties for Application Continuity.

Let’s patch!

Once all the preparations are completed, it’s time to see how RAC One Node deals with an active workload undergoing an online relocation during patching. First of all I need to start the workload. I’d normally use charbench for this, but this time around opted for the GUI. It shows performance graphs over a 3 minute rolling window.

A few minutes after starting the benchmark I commenced patching. Soon thereafter, opatchauto told me that it was relocating the database.

OPatchauto session is initiated at Fri Jul 27 14:52:23 2018

...

Relocating RACOne home before patching on home /u01/app/oracle/product/12.2.0.1/dbhome_1

According to the session log, this happened at 14:54. And by the way, always keep the session log, it’s invaluable!

2018-07-27 14:54:07,914 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - 
Executing command as oracle: 
 /bin/sh -c 'ORACLE_HOME=/u01/app/oracle/product/12.2.0.1/dbhome_1 
 /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/srvctl relocate database -d DCB -n rac122sec1 -w 30 -v'
2018-07-27 14:54:07,914 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - 
  COMMAND Looks like this: 
    /bin/sh -c 'ORACLE_HOME=/u01/app/oracle/product/12.2.0.1/dbhome_1 
    /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/srvctl relocate database -d DCB -n rac122sec1 -w 30 -v'
2018-07-27 14:55:31,736 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - 
  status: 0
2018-07-27 14:55:31,737 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - 
  Successfully executed the above command.

2018-07-27 14:55:31,737 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - 
  Output from the command:
2018-07-27 14:55:31,737 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor -
Configuration updated to two instances
Instance DCB_2 started
Services relocated
Waiting for up to 30 minutes for instance DCB_1 to stop ...
Instance DCB_1 stopped
Configuration updated to one instance

2018-07-27 14:55:31,737 INFO  [163] com.oracle.glcm.patch.auto.db.product.executor.PatchingStepExecutor - 
  Command executed successfully.

You can see the same information by querying Clusterware, although there aren’t any timestamps associated with it:

[root@rac122sec1 ~]# srvctl status database -db DCB -detail -verbose
Instance DCB_1 is running on node rac122sec2 with online services RON_SVC. Instance status: Open.
Instance DCB_1 is connected to ASM instance +ASM2
Online relocation: ACTIVE
Source instance: DCB_1 on rac122sec2
Destination instance: DCB_2 on rac122sec1
[root@rac122sec1 ~]# 

Although the online relocation timeout was set to 30 minutes, use of modern coding techniques and connection pooling allowed for a much faster online relocation. As you can see in the log excerpt the entire relocation was completed 2018-07-27 14:55:31,737. Clusterware now tells me that my database runs on node 1:

[root@rac122sec1 ~]# date;  srvctl status database -db DCB -detail -verbose
Fri 27 Jul 14:55:38 2018
Instance DCB_2 is running on node rac122sec1 with online services RON_SVC. Instance status: Open.
Instance DCB_2 is connected to ASM instance +ASM1
Online relocation: INACTIVE

While this is wicked, what are the implications for the application? Have a look at this print screen, taken a minute after the online relocation completed.

As you can see there was nothing unusual recorded (tab events to the left), and I couldn’t see a drop in the number of sessions connected. I noticed a slight blip in performance though but it recovered very soon thereafter.

Summary

During automatic patching of the Oracle stack opatchauto will perform an online relocation of a RAC One Node database if it is found running on the node currently undergoing patching. If your application is developed with RAC in mind – such as using a FAN-aware connection pool like UCP, and either supports Fast Connection Failover or Application Continuity, there is a good chance that patching the system does not affect your service.

While my test was successful, it is by no means representative of a production workload – my RAC One database has a SGA of 4 GB and there were 5 concurrent users – hardly what I see in the real world. However that shouldn’t deter you: if you are using RAC One Node I think it’s definitely worth a try implementing modern technology into the application.