OSWatcher as included in AHF 20.1.2 fails to start for single instance Oracle

I am about to update my posts detailing the use of Tracefile Analyzer (TFA) now that Oracle has merged it into its Autonomous Health Framework (AHF) and came across an interesting observation worth blogging about upfront.

After completing a fresh installation of AHF 20.1.2, the current version at the time of writing, I noticed OSWatcher didn’t start on my VM. I am operating a single instance Oracle 19.7.0 database, running on Oracle Linux 7.8/UEK 5. The system does not contain any traces of Grid Infrastructure.

I fully expect this problem to be transient, but until Oracle provides a fix I wanted to share my workaround. I didn’t find this problem covered in the usual sources, including My Oracle Support.

For the record, this is the version I can confirm to be affected:

[root@server4 ~]# /opt/oracle.ahf/bin/tfactl status

.-----------------------------------------------------------------------------------------------.
| Host    | Status of TFA | PID  | Port  | Version    | Build ID             | Inventory Status |
+---------+---------------+------+-------+------------+----------------------+------------------+
| server4 | RUNNING       | 9808 | 10905 | 20.1.2.0.0 | 20120020200403113404 | COMPLETE         |
'---------+---------------+------+-------+------------+----------------------+------------------'
[root@server4 ~]#  

The Problem

Right after ahf_setup completes, I can see that OSWatcher (oswbb) isn’t running:

[oracle@server4 ~]$ tfactl toolstatus

.------------------------------------------------------------------.
|                   TOOLS STATUS - HOST : server4                  |
+----------------------+--------------+--------------+-------------+
| Tool Type            | Tool         | Version      | Status      |
+----------------------+--------------+--------------+-------------+
| Development Tools    | orachk       |   19.3.0.0.0 | DEPLOYED    |
|                      | oratop       |       14.1.2 | DEPLOYED    |
+----------------------+--------------+--------------+-------------+
| Support Tools Bundle | darda        | 2.10.0.R6036 | DEPLOYED    |
|                      | oswbb        |        8.3.2 | NOT RUNNING |
|                      | prw          | 12.1.13.11.4 | NOT RUNNING |
+----------------------+--------------+--------------+-------------+
| TFA Utilities        | alertsummary |   19.3.0.0.0 | DEPLOYED    |
|                      | calog        |   19.3.0.0.0 | DEPLOYED    |
|                      | dbcheck      |   18.3.0.0.0 | DEPLOYED    |
|                      | dbglevel     |   19.3.0.0.0 | DEPLOYED    |
|                      | grep         |   19.3.0.0.0 | DEPLOYED    |
|                      | history      |   19.3.0.0.0 | DEPLOYED    |
|                      | ls           |   19.3.0.0.0 | DEPLOYED    |
|                      | managelogs   |   19.3.0.0.0 | DEPLOYED    |
|                      | menu         |   19.3.0.0.0 | DEPLOYED    |
|                      | param        |   19.3.0.0.0 | DEPLOYED    |
|                      | ps           |   19.3.0.0.0 | DEPLOYED    |
|                      | pstack       |   19.3.0.0.0 | DEPLOYED    |
|                      | summary      |   19.3.0.0.0 | DEPLOYED    |
|                      | tail         |   19.3.0.0.0 | DEPLOYED    |
|                      | triage       |   19.3.0.0.0 | DEPLOYED    |
|                      | vi           |   19.3.0.0.0 | DEPLOYED    |
'----------------------+--------------+--------------+-------------'

Note :-
  DEPLOYED    : Installed and Available - To be configured or run interactively.
  NOT RUNNING : Configured and Available - Currently turned off interactively.
  RUNNING     : Configured and Available. 

[oracle@server4 ~]$ ps -ef | grep -i osw
oracle   14567 13566  0 19:34 pts/3    00:00:00 grep --color=auto -i osw 

Please refrain from simply starting OSWatcher as root or in any other way in fact. This can have undesired consequences, as outlined in the Autonomous Health Framework documentation. It has the ability to mess up file system permissions in the OSWatcher archive directory which is really, really hard to recover from. Plus it can also change your configuration.

Troubleshooting

The reason for OSWatcher not starting can be found in the system’s journal:

[root@server4 ~]# systemctl status oracle-tfa
● oracle-tfa.service - Oracle Trace File Analyzer
   Loaded: loaded (/etc/systemd/system/oracle-tfa.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-04-23 19:30:29 BST; 8min ago
 Main PID: 9697 (init.tfa)

...

Apr 23 19:36:14 server4 init.tfa[9697]: . . .
Apr 23 19:36:14 server4 init.tfa[9697]: Successfully stopped TFA..
Apr 23 19:36:14 server4 init.tfa[9697]: Starting TFA..
Apr 23 19:36:15 server4 init.tfa[9697]: Starting TFA out of init, Should be running in 10 seconds
Apr 23 19:36:15 server4 init.tfa[9697]: Successfully updated jvmXmx to 128 in TFA...
Apr 23 19:36:22 server4 init.tfa[9697]: OSWatcher is already deployed at /opt/oracle.ahf/tfa/ext/oswbb
Apr 23 19:36:22 server4 init.tfa[9697]: Cannot find valid Non root user to run OSWatcher 

The last line wasn’t expected: “cannot find valid Non root user to run OSWatcher”.

Digging around a bit I noticed that OSWatcher doesn’t recognise the database home properly before firing the error message you just read. A long story short, TFA uses a file named tfa_setup.txt containing information about the system it’s running on when starting. In my case the file could be found in $TFA_DATA_DIR/$(hostname)/tfa/tfa_setup.txt.

This is what it looks like right now, the exact location where to find it depends on the AHF data dir you specified during the installation.

[root@server4 ~]# cat /opt/oracle.ahf/data/server4/tfa/tfa_setup.txt
INSTALL_TYPE=TYPICAL
PERL=/bin/perl
TFA_BASE=/tmp
JAVA_HOME=/opt/oracle.ahf/jre
DAEMON_OWNER=root
ENV_TYPE=TYPICAL
SIM_ENV_TYPE=none
SIMATP=FALSE
NODE_TYPE=TYPICAL
CRS_ACTIVE_VERSION=
[root@server4 ~]#  

I compared the file with a RAC system where OSWatcher just worked, and noticed the absence of an RDBMS home. I added the RDBMS home entry to the file:

[root@server4 ~]# cat /opt/oracle.ahf/data/server4/tfa/tfa_setup.txt
INSTALL_TYPE=TYPICAL
PERL=/bin/perl
TFA_BASE=/tmp
JAVA_HOME=/opt/oracle.ahf/jre
DAEMON_OWNER=root
ENV_TYPE=TYPICAL
SIM_ENV_TYPE=none
SIMATP=FALSE
NODE_TYPE=TYPICAL
CRS_ACTIVE_VERSION=
RDBMS_ORACLE_HOME=/u01/app/oracle/product/19.0.0/dbhome_1||
[root@server4 ~]#  

After which I restarted the TFA service, and voila! I now have a working OSWatcher configuration!

[oracle@server4 ~]$ tfactl toolstatus

.------------------------------------------------------------------.
|                   TOOLS STATUS - HOST : server4                  |
+----------------------+--------------+--------------+-------------+
| Tool Type            | Tool         | Version      | Status      |
+----------------------+--------------+--------------+-------------+
| Development Tools    | orachk       |   19.3.0.0.0 | DEPLOYED    |
|                      | oratop       |       14.1.2 | DEPLOYED    |
+----------------------+--------------+--------------+-------------+
| Support Tools Bundle | darda        | 2.10.0.R6036 | DEPLOYED    |
|                      | oswbb        |        8.3.2 | RUNNING     |
|                      | prw          | 12.1.13.11.4 | NOT RUNNING |
+----------------------+--------------+--------------+-------------+
| TFA Utilities        | alertsummary |   19.3.0.0.0 | DEPLOYED    |
|                      | calog        |   19.3.0.0.0 | DEPLOYED    |
|                      | dbcheck      |   18.3.0.0.0 | DEPLOYED    |
|                      | dbglevel     |   19.3.0.0.0 | DEPLOYED    |
|                      | grep         |   19.3.0.0.0 | DEPLOYED    |
|                      | history      |   19.3.0.0.0 | DEPLOYED    |
|                      | ls           |   19.3.0.0.0 | DEPLOYED    |
|                      | managelogs   |   19.3.0.0.0 | DEPLOYED    |
|                      | menu         |   19.3.0.0.0 | DEPLOYED    |
|                      | param        |   19.3.0.0.0 | DEPLOYED    |
|                      | ps           |   19.3.0.0.0 | DEPLOYED    |
|                      | pstack       |   19.3.0.0.0 | DEPLOYED    |
|                      | summary      |   19.3.0.0.0 | DEPLOYED    |
|                      | tail         |   19.3.0.0.0 | DEPLOYED    |
|                      | triage       |   19.3.0.0.0 | DEPLOYED    |
|                      | vi           |   19.3.0.0.0 | DEPLOYED    |
'----------------------+--------------+--------------+-------------'

Note :-
  DEPLOYED    : Installed and Available - To be configured or run interactively.
  NOT RUNNING : Configured and Available - Currently turned off interactively.
  RUNNING     : Configured and Available.

[oracle@server4 ~]$ ps -ef | grep -i osw
oracle   19770     1  0 19:43 ?        00:00:00 /bin/sh ./OSWatcher.sh 30 48 NONE /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive
oracle   19944 19770  0 19:44 ?        00:00:00 /bin/sh ./OSWatcherFM.sh 48 /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive
oracle   22672 19770  0 19:47 ?        00:00:00 /bin/sh ./vmsub.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswvmstat/server4_vmstat_20.04.23.1900.dat vmstat 1 3 1
oracle   22674 19770  0 19:47 ?        00:00:00 /bin/sh ./mpsub.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswmpstat/server4_mpstat_20.04.23.1900.dat mpstat -P ALL 1 2 1
oracle   22682 19770  0 19:47 ?        00:00:00 /bin/sh ./iosub.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswiostat/server4_iostat_20.04.23.1900.dat iostat -xk 1 3 1
oracle   22686 19770  0 19:47 ?        00:00:00 /bin/sh ./piddsub.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswpidstatd/server4_pidstatd_20.04.23.1900.dat  1
oracle   22692 19770  0 19:47 ?        00:00:00 /bin/sh ./pidsub.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswpidstat/server4_pidstat_20.04.23.1900.dat pidstat 1
oracle   22695 19770  0 19:47 ?        00:00:00 /bin/sh ./nfssub.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswnfsiostat/server4_nfsiostat_20.04.23.1900.dat nfsiostat 1 3 1
oracle   22705 19770  0 19:47 ?        00:00:00 /bin/sh ./xtop.sh /opt/oracle.ahf/data/repository/suptools/server4/oswbb/oracle/archive/oswtop/server4_top_20.04.23.1900.dat 1
oracle   22730 13566  0 19:47 pts/3    00:00:00 grep --color=auto -i osw
[oracle@server4 ~]$  

I don’t know what went wrong with my environment, but I do know this happens consistently with my single instance deployments. I’ll raise a service request with support to see if they can get the problem fixed.

Happy Troubleshooting!