Using OSWatcher for system diagnostics
Posted by Martin Bach on June 14, 2016
OSWatcher is a superb tool that gathers information about your system in the background and stores it in an (optionally compressed) archive directory. As an Oracle DBA I like the analogy with statspack: you make the tool available on the host in a location with – very important – enough available disk space and then start it. Most users add it to the startup mechanism their O/S uses- SysV init, upstart, or systemd for example on Linux to allow it to start in the background. OSWatcher will then gather a lot of the interesting O/S related statistics that you so desperately need in an “after the fact” situation. There are plenty of reasons where you might want that information.
Example use cases
Say for example the system experienced an outage-a RAC node rebooted. The big question is “why?”. There might be information in the usual Grid Infrastructure logs but they might be inconclusive and a look at the O/S can be necessary. The granularity SAR offers is often not enough, you are bound to loose detail information in the 10 minute average. What if you had 30 second granularity to see what happened more or less right before the crash?
Or maybe you want to see the system’s activity during testing for a new software release? And then compare to yesterday’s test cycle? The use cases are endless, and I’m sure you can think of one, too.
ExaWatcher as the role model
In Expert Oracle Exadata (both editions) the authors covered ExaWatcher quite extensively. ExaWatcher is (not really surprisingly) OSWatcher for Exadata, or at least sort of. And it’s there by default on cells and compute nodes. I have always taken the stance that if something is implemented in Exadata then someone must have spent a few brain cells working out why it’s a good idea to deploy that tool to every Exadata in the world. That’s a big responsibility to have and you’d hope that there was some serious thinking behind decisions to deploy such a tool.
I deploy OSWatcher by default on every VM I create in my lab. It’s part of the automated build process I use. That ensures that I’ll have a log of recent O/S information whenever I need it. Again the analogy to Statspack: just like OSWatcher Statspack isn’t installed by default and has to be created manually. In so many cases however it was deployed after the problem occurred. It could be that last night’s problems don’t re-appear the following execution cycle. It might be better to have OSWatcher deployed and recording information proactively.
Deploying OSWatcher is dead simple-get it from MOS and copy the tarball to the host to monitor. In this post I am using Oracle Linux 7.1, at first without the compatibility tools for the network stack. I customarily use the minimal installation that doesn’t come with net-tools. What are the net-tools again? They are the tools we have come to love over the years:
[oracle@rac12node2 ~]$ yum info net-tools Available Packages Name : net-tools Arch : x86_64 Version : 2.0 Release : 0.17.20131004git.el7 Size : 303 k Repo : local Summary : Basic networking tools URL : http://sourceforge.net/projects/net-tools/ Licence : GPLv2+ Description : The net-tools package contains basic networking tools, : including ifconfig, netstat, route, and others. : Most of them are obsolete. For replacement check iproute package.
But they are obsolete. You could argue that it might be just the time to get used to iproute as it’s the future, but for OSWatcher on OL 7 net-tools are needed and I installed them. If you don’t OSWatcher won’t abort but can’t collect information netstat and ifconfig provide.
In case you are using RAC you need to make OSWatcher aware of the private interconnect-this is detailed in the OSWatcher user’s guide.
Once the tarball is unzipped in a location of your choice you need to start OSWatcher. The script startOSWbb.sh is responsible for starting the tool, and it can take a number of arguments that are explained in the shell script for convenience and the user guide available from MOS. The first parameter is used to set the snapshot interval in seconds. The second indicates the archive duration in hours (eg how long the information should be stored) and the optional third which compression tool to use for compressing the raw data. Parameters 2 and 3 have direct implications to space usage. The optional fourth parameter is used to set a custom location for the archive directory that will contain the data collected.
I want to have 30 second snapshot intervals and retain data for 2 days or 48 hours. Be aware that using a tool such as compress or gzip as the third argument will save space by compressing older files, but when you are trying to use the analyser (more on that later) then you’ll have to decompress the files first. Clever use of the find (1) command can help you uncompressing only the raw data files you need.
|WARNING||Depending on the system you monitor and the amount of activity you might end up generating quite a lot of data. And by a lot I mean it! The location you choose for storing the archive directory must be independent of anything important. In other words, should the location fill up despite all the efforts you put in to prevent that from happening, it must not have an impact on availability. It is imperative to keep a keen eye on space usage. You certainly don’t want to fill up important mount points with performance information.|
Starting OSWatcher from the command line
With that said it’s time to start the tool:
[oracle@rac12node1 oswbb]$ ./startOSWbb.sh 30 48 gzip [oracle@rac12node1 oswbb]$ Info...Zip option IS specified. Info...OSW will use gzip to compress files. Setting the archive log directory to/some/mount/point/oswbb/archive Testing for discovery of OS Utilities... VMSTAT found on your system. IOSTAT found on your system. MPSTAT found on your system. IFCONFIG found on your system. NETSTAT found on your system. TOP found on your system. Warning... /proc/slabinfo not found on your system. Testing for discovery of OS CPU COUNT oswbb is looking for the CPU COUNT on your system CPU COUNT will be used by oswbba to automatically look for cpu problems CPU COUNT found on your system. CPU COUNT = 2 Discovery completed. Starting OSWatcher v7.3.3 on Wed Dec 16 10:02:49 GMT 2015 With SnapshotInterval = 30 With ArchiveInterval = 48 OSWatcher - Written by Carl Davis, Center of Expertise, Oracle Corporation For questions on install/usage please go to MOS (Note:301137.1) ... Data is stored in directory: /some/mount/point/oswbb/archive Starting Data Collection... oswbb heartbeat:Wed Dec 16 10:02:54 GMT 2015
Launching OSWatcher from the command line is probably the exception, most users will start OSW during the boot process as part of the multi-user runlevel.
Note that if you haven’t installed net-tools on Oracle Linux 7 you will see errors when starting OSWatcher as it can’t find netstat and ifconfig. If memory serves me right then you get net-tools with previous releases of Oracle Linux by default so this is OL 7 specific.
You will also notice an error that /proc/slabinfo does not exist. It does exist, but the permissions have changed to 0400 and the file is owned by root:root. Not having slab information is not a problem for me, I always struggle to make sense of it anyway.
What it does
The OSWatcher daemon now happily monitors my system and places information into the archive directory:
[oracle@rac12node1 oswbb]$ ls -l archive/ total 40 drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswifconfig drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswiostat drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswmeminfo drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswmpstat drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswnetstat drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswprvtnet drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswps drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswslabinfo drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswtop drwxr-xr-x. 2 oracle oinstall 4096 Dec 16 10:02 oswvmstat
There is a directory per tool used – ifconfig, iostat, meminfo, …. , vmstat. Inside the directory you find the output. The file format is plain text (optionally compressed if not the current file if you specified gzip or compress as the third parameter to the start script), and each snapshot is indicated by a line starting zzz. Here is an example for iostat taken while swingbench was running.
... zzz ***Wed Dec 16 10:32:30 GMT 2015 avg-cpu: %user %nice %system %iowait %steal %idle 12.23 0.00 7.45 76.06 0.00 4.26 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 0.00 0.00 2.00 0.00 4.00 4.00 0.24 128.00 0.00 128.00 121.00 24.20 vdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdd 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.11 36.67 33.00 44.00 36.67 11.00 vde 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.04 14.67 0.00 44.00 14.67 4.40 vdf 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.10 31.67 25.50 44.00 31.67 9.50 vdg 0.00 0.00 43.00 23.00 400.00 97.00 15.06 3.83 57.05 62.37 47.09 15.15 100.00 vdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 24.20 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdi 0.00 0.00 39.00 23.00 320.00 78.50 12.85 1.11 17.40 2.85 42.09 12.39 76.80 zzz ***Wed Dec 16 10:33:00 GMT 2015 avg-cpu: %user %nice %system %iowait %steal %idle 10.58 0.00 7.94 81.48 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 17.00 0.00 1.00 0.00 72.00 144.00 0.01 1.00 0.00 1.00 11.00 1.10 vdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdc 0.00 0.00 12.00 2.00 192.00 16.00 29.71 0.13 9.57 8.25 17.50 9.57 13.40 vdd 0.00 0.00 2.00 2.00 1.00 0.50 0.75 0.03 14.50 8.50 20.50 7.25 2.90 vde 0.00 0.00 2.00 2.00 1.00 0.50 0.75 0.02 13.25 2.00 24.50 6.00 2.40 vdf 0.00 0.00 2.00 2.00 1.00 0.50 0.75 0.11 34.50 23.00 46.00 27.25 10.90 vdg 0.00 0.00 32.00 31.00 272.00 123.00 12.54 4.96 80.46 90.50 70.10 15.68 98.80 vdh 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.01 37.00 0.00 37.00 8.00 0.80 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 19.00 0.00 76.00 8.00 0.03 0.95 0.00 0.95 0.63 1.20 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdi 0.00 0.00 22.00 22.00 184.00 78.50 11.93 1.28 30.45 2.00 58.91 17.23 75.80 zzz ***Wed Dec 16 10:33:30 GMT 2015 ...
As you can see there are snapshots every 30 seconds, just as requested. Let’s not focus on the horrendous I/O times-this is a virtualised RAC system on a host that struggles with a CPU bottleneck, and it does not have underlying SSD for the VMs…
You can navigate the archive directory and browse the files you are interested in. They all have timestamps in their names making it easy to identify each file’s contents.
The OSWatcher Analyser
Looking at text files is one way of digesting information. There is another option, the OSWatcher analyser. Here is an example of its use (it requires the DISPLAY variable to be set):
[oracle@rac12node1 oswbb]$ java -jar oswbba.jar -i archive -b "Dec 16 10:25:00 2015" \ > -e "Dec 16 10:35:00 2015" -P swingbench -s Validating times in the archive... Scanning file headers for version and platform info... Parsing file rac12node1_iostat_188.8.131.520.dat ... Parsing file rac12node1_vmstat_184.108.40.2060.dat ... Parsing file rac12node1_netstat_220.127.116.110.dat ... Parsing file rac12node1_top_18.104.22.1680.dat ... Parsing file rac12node1_ps_22.214.171.1240.dat ... A new analysis file analysis/rac12node1_1450262374129.txt has been created. Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Run_Queue.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Block_Queue.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Cpu_Idle.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Cpu_System.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Cpu_User.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Cpu_Wa.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Cpu_Interrupts.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Context_Switches.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Memory_Swap.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Memory_Free.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_Memory_Page_In_Rate.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_ST.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_RPS.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_WPS.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_PB.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_PBTP_1.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_PBTP_2.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_PBTP_3.gif Generating file profile/rac12node1_swingbench/OSW_profile_files/OSWg_OS_IO_TPS.gif [oracle@rac12node1 oswbb]$
In this example I asked the analyser tool to limit the search to specific time ranges and to create a “profile”. Refer to the MOS note about the OSWatcher Analyser for more information about the command line options.
The analyser will create a HTML “Profile” which is then stored in the profile directory as you can see in the last part of the output. If you transfer this to a system with a web-browser you can enjoy the graphical representation of the raw data. Very neat if you want to check on the O/S level if anything unusual might have happened.
Note also how there was a new analysis file created-have a look at it as it can provide very valuable information. In my example the following contents was recorded:
############################################################################ # Contents Of This Report: # # Section 1: System Status # Section 2: System Slowdowns # Section 2.1: System Slowdown RCA Process Level Ordered By Impact # Section 3: System General Findings # Section 4: CPU Detailed Findings # Section 4.1: CPU Run Queue: # Section 4.2: CPU Utilization: Percent Busy # Section 4.3: CPU Utilization: Percent Sys # Section 5: Memory Detailed Findings # Section 5.1: Memory: Process Swap Queue # Section 5.2: Memory: Scan Rate # Section 5.3 Memory: Page In: # Section 5.4 Memory: Page Tables (Linux only): # Section 5.5: Top 5 Memory Consuming Processes Beginning # Section 5.6: Top 5 Memory Consuming Processes Ending # Section 6: Disk Detailed Findings # Section 6.1: Disk Percent Utilization Findings # Section 6.2: Disk Service Times Findings # Section 6.3: Disk Wait Queue Times Findings # Section 6.4: Disk Throughput Findings # Section 6.5: Disk Reads Per Second # Section 6.6: Disk Writes Per Second # Section 6.7: Disk Percent CPU waiting on I/O # Section 7: Network Detailed Findings # Section 7.1 Network Data Link Findings # Section 7.2: Network IP Findings # Section 7.3: Network UDP Findings # Section 7.4: Network TCP Findings # Section 8: Process Detailed Findings # Section 8.1: PS Process Summary Ordered By Time # Section 8.2: PS for Processes With Status = D or T Ordered By Time # Section 8.3: PS for (Processes with CPU > 0) When System Idle CPU < 30% Ordered By Time # Section 8.4: Top VSZ Processes Increasing Memory Per Snapshot # Section 8.5: Top RSS Processes Increasing Memory Per Snapshot # ############################################################################
OSWatcher is a great tool that can be used on Oracle Linux 7 and other supported platforms. It provides a wealth of information both in textual as well as graphical format. It is invaluable in many situations where you need to retrospectively have a look at the state of the O/S in a given period and records lots of useful information.
As always, read the documentation found on MOS and make sure you understand the implications of using the tool. Also make sure you test it thoroughly first. Please ensure that you have sufficient disk space for your archive directory on a mount point that cannot affect the availability of your system.