Martins Blog

Trying to explain complex things in simple terms

An introduction to collectl

Posted by Martin Bach on August 5, 2011

Some of you may have seen on twitter that I was working on understanding collectl. So why did I start with this? First of all, I was after a tool that records a lot of information on a Linux box. It can also play information back, but this is out of scope of this introduction.

In the past I have used nmon to do similar things, and still love it for what it does. Especially in conjunction with the nmon-analyzer, an Excel plug in it can create very impressive reports. How does collectl compare?

Getting collectl

Getting collectl is quite easy-get it from sourceforge: http://sourceforge.net/projects/collectl/

The project website including very good documentation is available from sourceforge as well, but uses a slightly different URL: http://collectl.sourceforge.net/

I suggest you get the archive-independent RPM and install it on your system. This is all you need to get started! The impatient could type “collectl” at the command prompt now to get some information. Let’s have a look at the output:

$ collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
1   0  1163  10496    113     14     18      4      8     55      5      19
0   0  1046  10544      0      0      2      3    164    195     30      60
0   0  1279  10603    144      9    746    148     20     67     11      19
3   0  1168  10615    144      9    414     69     14     69      5      20
1   0  1121  10416    362     28    225     19     11     71      8      35
Ouch!

The “ouch” has been caused by my CTRL-c to stop the execution.

Collectl is organised to work by subsystems, the standard option is to print CPU, disk and network subsystem, aggregated.

If you don’t know what information you are after, you could use the –all flag to display aggregated information across all subsystems. Be warned that you need a large screen for all that output! For even more output, add the –verbose flag to the –all option and you need a 22” screen at least. The verbose flag prints more output, as the name suggests. For the disk subsystem you can view the difference:

$ collectl -sd -i 5 --verbose
waiting for 5 second sample...

# DISK SUMMARY (/sec)
#KBRead RMerged  Reads SizeKB  KBWrite WMerged Writes SizeKB
162     136     10     15      187      30     19      9
109      24      9     11      566     118     23     24
Ouch!
$ collectl -sd -i 5
waiting for 5 second sample...
#<----------Disks----------->
#KBRead  Reads KBWrit Writes
9865     73    190     23
Ouch!

Each subsystem can be queried individually, the default monitoring interval is 1 second. The man page for collectl lists the following subsystems:

SUMMARY SUBSYSTEMS

b - buddy info (memory fragmentation)
c - CPU
d - Disk
f - NFS V3 Data
i - Inode and File System
j - Interrupts
l - Lustre
m - Memory
n - Networks
s - Sockets
t - TCP
x - Interconnect
y - Slabs (system object caches)

As the name suggests, these sub systems provide summary information. Summaries are ok for a first overview, but don’t forget that information is aggregated and detail is lost.

From an Oracle point of view I’d probably be most interested in the CPU, disk and memory usage. If you are using RAC, network usage can also be interesting.

Detailed subsystem information is available for these (again taken from the excellen manual page):

C - CPU
D - Disk
E - Environmental data (fan, power, temp),  via ipmitool
F - NFS Data
J - Interrupts
L - Lustre OST detail OR client Filesystem detail
N - Networks
T - 65 TCP counters only available in plot format
X - Interconnect
Y - Slabs (system object caches)
Z - Processes

You can combine subsystems, and you can combine detail and summary information. Bear in mind though that this becomes a lot of information for a putty session of gnome-terminal!

In interactive mode, you might want to consider the –home flag, which does a top-like refresh and prints real time information without scrolling: very neat!

But even with the –-home option, digesting all that information visually can be a bit daunting, which leads me to my next section.

Generating graphical output

While all the textual information is all nice and good, it is difficult to visualise. Collectl can help you with that as well. All you need to do is generate a file in tab format, which is as simple as adding the –P and –f options. Since you can’t be overwhelmed with the information gathered in a file (unlike on standard out), you could use the detail switches. If you have the luxury, create the file with the information in a directory expored via samba and analyse it with Excel or other utilities. It’s possible to use gnuplot as well, but I found that a bit lacking for interactive use. The collectl-utils provide a CGI script to analyse collectl files on the host which can be convenient. Here is an example for measuring CPU, memory and all disks with a monitoring interval of 15 seconds. The file will be in “Plot” format (-P) and goes to /export/collectl/plotfiles:

$ collectl -sdmn -i 15 -P -f /export/collectl/plotfiles

Note that you can’t use the verbose flag here, and you also shouldn’t use a file name with the –f switch!

The resulting file is called hostname-yyyymmdd.tab. After renaming it to hostname-yyyymmdd.txt it can quite easily be imported using your favourite spreadsheet application. Imagine all the graphs you could produce with it! Also the header contains interesting information:

################################################################################
# Collectl:   V3.5.1-1  HiRes: 1  Options: -sdmn -i 15 -P -f /export/collectl/plotfiles
# Host:       node1 DaemonOpts:
# Distro:     Red Hat Enterprise Linux Server release 5.5 (Tikanga)  Platform:
# Date:       20110805-142647  Secs: 1312550807 TZ: +0100
# SubSys:     dmn Options: z Interval: 1 NumCPUs: 16  NumBud: 0 Flags: i
# Filters:    NfsFilt:  EnvFilt:
# HZ:         100  Arch: x86_64-linux-thread-multi PageSize: 4096
# Cpu:        AuthenticAMD Speed(MHz): 2210.190 Cores: 4  Siblings: 4
# Kernel:     2.6.18-194.el5  Memory: 65990460 kB  Swap: 16809976 kB
# NumDisks:   173 DiskNames: c0d0 sda sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj sdak sdal sdam sdan sdao sdap sdaq sdar sdas sdat sdau sdav sdaw sdax sday sdaz sdba sdbb sdbc sdbd sdbe sdbf sdbg sdbh sdbi sdbj sdbk sdbl sdbm sdbn sdbo sdbp sdbq sdbr sdbs sdbt sdbu sdbv sdbw sdbx sdby sdbz sdca sdcb sdcc sdcd sdce sdcf sdcg sdch sdci sdcj sdck sdcl sdcm sdcn sdco sdcp sdcq sdcr sdcs sdct sdcu sdcv sdcw sdcx sdcy sdcz sdda sddb sddc sddd sdde sddf sddg dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 dm-7 dm-8 dm-9 dm-10 dm-11 dm-12 dm-13 dm-14 dm-15 dm-16 dm-17 dm-18 dm-19 dm-20 dm-21 dm-22 dm-23 dm-24 dm-25 dm-26 dm-27 dm-28 dm-29 dm-30 dm-31 dm-32 dm-33 dm-34 dm-35 dm-36 dm-37 dm-38 dm-39 dm-40 dm-41 dm-42 dm-43 dm-44 dm-45 dm-46 dm-47 dm-48 dm-49 dm-50 dm-51 dm-52 dm-53 dm-54 dm-55 dm-56 dm-57 dm-58 dm-59 dm-60
# NumNets:    8 NetNames: lo: eth0: eth1: eth2: eth3: sit0: bond0: bond1:
# SCSI:       DA:0:00:00: ... DA:2:00:00:00
################################################################################

This should be enough to remind you of where you were running this test.

Run duration and interval

Use the –i flag to change the monitoring interval, this is the same as you’d do with SAR or iostat/vmstat and the like. You could then either use the –c option to count n samples, or alternatively use –R to run for n weeks, days, hours, minutes or seconds, each of which are abridged with their first letter. For example to run for 15 minutes with samples taken every 15 seconds, you’d say collectl –i 15 –R 15m.

Quick and dirty

If you need an interactive overview of what’s going on top-style, you can use the –top flag. This will print output very similar to the top command, but this time you have a lot more options to sort on. Use collectl –showtopops. This is so cool that I couldn’t help just listing the options here:

$ collectl --showtopopts
The following is a list of --top's sort types which apply to either
process or slab data.  In some cases you may be allowed to sort
by a field that is not part of the display if you so desire

TOP PROCESS SORT FIELDS

Memory
vsz    virtual memory
rss    resident (physical) memory

Time
syst   system time
usrt   user time
time   total time

I/O
rkb    KB read
wkb    KB written
iokb   total I/O KB

rkbc   KB read from pagecache
wkbc   KB written to pagecache
iokbc  total pagecacge I/O
ioall  total I/O KB (iokb+iokbc)

rsys   read system calls
wsys   write system calls
iosys  total system calls

iocncl Cancelled write bytes

Page Faults
majf   major page faults
minf   minor page faults
flt    total page faults

Miscellaneous (best when used with --procfilt)
cpu    cpu number
pid    process pid
thread total process threads (not counting main)

TOP SLAB SORT FIELDS

numobj    total number of slab objects
actobj    active slab objects
objsize   sizes of slab objects
numslab   number of slabs
objslab   number of objects in a slab
totsize   total memory sizes taken by slabs
totchg    change in memory sizes
totpct    percent change in memory sizes
name      slab names

Filtering information

Let’s say you are running multiple ASM disk groups in your system, but you are only interested in the performance of disk group DATA. The –sD flag will print all the information for all disks (LUNs) of the system. Collectl reports disks as the native devices and dm- devices. For multipathed devices you obviously want to look at the dm- device. You could use the multipath –ll command to map dm- device to WWIDs and your disks in the end. Let’s say you found out that the disks you need to look at are /dev/dm-{1,3,5,8} you could use the –dskfilt flag, which takes a perl regex. In my example, I could use the following command to check on those disks:

collectl -sD -c 1 --dskfilt "dm-(1\b|3\b|5\b|8\b)"
waiting for 1 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
dm-1             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-5             0      0    0    0       0      0    1    1       0     0     0      0    0
dm-8             0      0    0    0       0      0    0    0       0     0     0      0    0
$

Note the “\b” boundary, which is my uneducated way to saying that the expression should match dm-1, but not 10, or anything else that extends beyond number one.

Additional filters you can apply can be found in the output of collectl –showsubopts as well as in section subsystem options in the manpage.

Summary

Used correctly, collectl is the swiss army knife for system monitoring, the level of detail which can be gathered is breathtaking. Thanks Mark Seger! And aplogies for all the good stuff I’ve been missing!

About these ads

6 Responses to “An introduction to collectl”

  1. Good article, Martin. I’ve relied heavily on collectl and promoted it widely within Oracle development during my tenure as Performance Architect in the Exadata development organization. We all love it…

    Will you be covering colplot ?

  2. Mark Seger said

    Very nice write up. You managed to cover a lot of the highlights in a fairly dense amount of space. One slight correction, and I’m guessing you just weren’t aware of this, is when you generate a tab file you don’t need to rename it to txt to load it into excel. You also didn’t mention colplot, but maybe that wasn’t your intent. Initially I used to use excel to plot the data but it was so painful I wrote colplot which is a web-based gui that lists a ton of canned plots with checkboxes and and you just point it to a directory that contains the plottable files. That makes it much easier, especially when trying to look at data across multiple systems.

    One other thing I’d suggest you check out if you haven’t already, is colmux. With it you can monitor multiple systems at once, and present the output sorted by a particular column. In fact, if you only monitor one system and look at all the processes (-sZ) or even disks (-sD), you can then display the output in ‘top-like’ format and see which disk is the slowest (or fastest) or which process is having the most page faults, using the most CPU or any other things you might like to look at. ;) It you install term-readkey, you can even change the sort column using the arrow keys. check it out, you’ll like it.

    -mark

  3. This is excellent. I normally use nmon with nmon-analyzer on AIX but this looks just as useful. Thanks Martin.

  4. [...] http://martincarstenbach.wordpress.com/2011/08/05/an-introduction-to-collectl/ [...]

  5. [...] – I have blogged about it before here and mention it for reference [...]

  6. […] itself producing numbers based on 8k reads. How do I know? Because I’m using collectl (http://martincarstenbach.wordpress.com/2011/08/05/an-introduction-to-collectl/) in another […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 2,082 other followers

%d bloggers like this: