If you look at some of the examples of how to use bpftrace – more specifically how to enable kprobes – you might wonder how to use the tool in more advanced cases. This article was prompted by my research into why biotop as part of the bcc tools initially reported 0 kb read/written on my Oracle Linux 8.4/UEK6 (5.4.17-2102.203.6.el8uek.x86_64) VM. I’ll try to show you how to work with kernel data structures in bpftrace
in a way similar to what biotop
does. In the end, it turned out I used an outdated version of biotop
.
Mandatory warning about using profilers and tracers
As with every tracer/profiler, using tools such as bpftrace
, the bcc toolset (or said other profiler/tracers) outside of lab/playground environments is strongly discouraged as it can unexpectedly change a process’s behaviour, slow it down or even crash the process under investigation This article is for educational purposes only and purely to be used for research on a lab VM where potential damage is limited to the individual experimenting with the tools. Never use this in production!
Strange: biotop reports an I/O size of 0 kb for all processes
While writing another post about LVM and its linear model of extent allocation I wanted to check which process on my system wrote to which block device. biotop
should be well suited to answer that question, yet while it did succeed in listing processes and their I/O against block devices, it fell short in reporting the amount of I/O performed in the version provided by my distribution:
[opc@oracle-19c-fs ~]$ sudo /usr/share/bcc/tools/biotop -C 5 2 Tracing... Output every 5 secs. Hit Ctrl-C to end 13:00:23 loadavg: 1.56 0.40 1.27 7/678 74843 PID COMM D MAJ MIN DISK I/O Kbytes AVGms 8080 ora_lg01_orcl W 8 16 sdb 699 0.0 0.52 74713 oracle_74713_or R 8 16 sdb 3 0.0 0.54 74476 kworker/u32:1 W 8 0 sda 1 0.0 1.18 1067 xfsaild/dm-0 W 8 0 sda 1 0.0 0.76 74711 oracle_74711_or R 8 16 sdb 7 0.0 0.53 70091 kworker/u32:3 W 8 0 sda 1 0.0 0.64 74721 oracle_74721_or R 8 16 sdb 4 0.0 0.80 74708 oracle_74708_or R 8 16 sdb 8 0.0 0.51 8074 ora_lg00_orcl W 8 16 sdb 2632 0.0 0.54 8062 ora_dbw1_orcl W 8 16 sdb 8660 0.0 0.54 8070 ora_ckpt_orcl W 8 48 sdd 1 0.0 0.90 73549 kworker/6:2 W 8 0 sda 1 0.0 1.13 74108 kworker/3:1 W 8 0 sda 1 0.0 0.79 8058 ora_dbw0_orcl W 8 16 sdb 8655 0.0 0.54 74717 oracle_74717_or R 8 16 sdb 2 0.0 0.55 74719 oracle_74719_or R 8 16 sdb 3 0.0 0.58 74715 oracle_74715_or R 8 16 sdb 3 0.0 0.51 8070 ora_ckpt_orcl W 8 16 sdb 1 0.0 0.73 74723 oracle_74723_or R 8 16 sdb 3 0.0 0.56 13:00:28 loadavg: 2.07 0.52 1.30 6/678 74843 PID COMM D MAJ MIN DISK I/O Kbytes AVGms 8080 ora_lg01_orcl W 8 16 sdb 713 0.0 0.51 2767 xfsaild/dm-2 W 8 64 sde 1 0.0 0.73 63332 kworker/5:2 W 8 64 sde 1 0.0 0.54 74713 oracle_74713_or R 8 16 sdb 1 0.0 0.61 74711 oracle_74711_or R 8 16 sdb 3 0.0 0.66 74721 oracle_74721_or R 8 16 sdb 3 0.0 0.53 74152 kworker/12:3 W 8 16 sdb 1 0.0 0.45 74708 oracle_74708_or R 8 16 sdb 3 0.0 0.54 8074 ora_lg00_orcl W 8 16 sdb 2622 0.0 0.54 8062 ora_dbw1_orcl W 8 16 sdb 8706 0.0 0.54 8070 ora_ckpt_orcl W 8 48 sdd 2 0.0 0.79 2994 xfsaild/dm-4 W 8 16 sdb 1 0.0 0.46 8058 ora_dbw0_orcl W 8 16 sdb 8715 0.0 0.54 74717 oracle_74717_or R 8 16 sdb 5 0.0 0.54 74719 oracle_74719_or R 8 16 sdb 4 0.0 0.53 74715 oracle_74715_or R 8 16 sdb 3 0.0 0.58 8070 ora_ckpt_orcl W 8 16 sdb 2 0.0 0.72 74723 oracle_74723_or R 8 16 sdb 3 0.0 0.48 Detaching...
Here you can see Swingench at work, it’s mainly the log writer and database writer busy performing I/O. I didn’t fail to notice that every single value in kbytes was 0. Yet clearly there is I/O going on:
[opc@oracle-19c-fs ~]$ iostat -xmz 5 Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs) 04/08/21 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 14.47 0.01 4.26 6.11 0.05 75.10 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 1.63 1.12 0.05 0.03 0.03 0.42 ... 0.15 dm-0 1.59 0.95 0.04 0.02 0.00 0.00 ... 0.13 dm-1 0.02 0.58 0.00 0.01 0.00 0.00 ... 0.02 sdb 85.70 2903.19 0.94 25.58 0.00 0.22 ... 71.99 sdc 0.02 0.00 0.00 0.00 0.00 0.00 ... 0.00 sdd 5.67 287.03 0.02 2.27 0.00 0.84 ... 33.25 sde 0.50 0.12 0.02 0.00 0.00 0.03 ... 0.04 dm-2 0.49 0.15 0.02 0.00 0.00 0.00 ... 0.04 dm-3 0.01 0.00 0.00 0.00 0.00 0.00 ... 0.00 dm-4 91.35 3191.47 0.97 27.85 0.00 0.00 ... 72.02 avg-cpu: %user %nice %system %iowait %steal %idle 20.25 0.00 6.01 8.44 0.10 65.20 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 0.00 0.20 0.00 0.00 0.00 0.00 ... 0.04 dm-0 0.00 0.20 0.00 0.00 0.00 0.00 ... 0.04 sdb 9.40 4108.80 0.08 37.29 0.00 0.20 ... 99.70 sdd 0.00 0.40 0.00 0.01 0.00 0.00 ... 0.08 dm-4 9.40 4109.80 0.08 37.29 0.00 0.00 ... 99.70 avg-cpu: %user %nice %system %iowait %steal %idle 20.85 0.00 6.00 5.75 0.06 67.34 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s ... %util sda 0.00 1.40 0.00 0.01 0.00 0.20 ... 0.06 dm-0 0.00 0.20 0.00 0.00 0.00 0.00 ... 0.02 dm-1 0.00 1.40 0.00 0.01 0.00 0.00 ... 0.04 sdb 9.80 4098.40 0.10 36.41 0.00 1.20 ... 99.86 sdd 0.00 1.20 0.00 0.02 0.00 0.00 ... 0.20 dm-4 9.80 4100.60 0.10 36.43 0.00 0.00 ... 99.84 ^C
Linux is writing roughly 37 MB/s to /dev/sdb
. A little investigation seemed in order.
Trying to understand biotop
The nice thing about open source software is that you can read the code. If you are unsure about the Linux storage stack like me, you might find the Linux storage stack diagram useful. If I understand the bcc/python interaction correctly (and there is a high probability that I might not!), biotop.py
written by Brendan Gregg attaches to the following kprobes:
- blk_account_io_start
- blk_start_request (if available)
- blk_mq_start_request
- blk_account_io_done
At the start of the blk-mq request (I’m using Oracle Linux 8 where blk-mq is enabled by default) the callback in biotop
reads the request data and stores it temporarily for later use. More specifically, the code uses req->__data_len
to get the size of the I/O request.
A kprobe allows me to dynamically attach a piece of code to a kernel function. To better understand the function I need to know more about it. But where do I find the function definition in the kernel code? As per the UEK readme I should probably look at branch uek6/master which is close enough to my kernel release. I found the call to blk_mq_start_request
on line 732 in block/blk-mq.c:
... /** * blk_mq_start_request - Start processing a request * @rq: Pointer to request to be started * * Function used by device drivers to notify the block layer that a request * is going to be processed now, so blk layer can do proper initializations * such as starting the timeout timer. */ void blk_mq_start_request(struct request *rq) { ...
That request pointer looks familiar, it’s the same as in biotop.py
. After a little bit of searching I found struct request
in include/linux/blkdev.h
beginning at line 127:
/* * Try to put the fields that are referenced together in the same cacheline. * * If you modify this structure, make sure to update blk_rq_init() and * especially blk_mq_rq_ctx_init() to take care of the added fields. */ struct request { struct request_queue *q; struct blk_mq_ctx *mq_ctx; struct blk_mq_hw_ctx *mq_hctx; unsigned int cmd_flags; /* op and common flags */ req_flags_t rq_flags; int tag; int internal_tag; /* the following two fields are internal, NEVER access directly */ unsigned int __data_len; /* total data len */ sector_t __sector; /* sector cursor */ struct bio *bio;
And yes, __data_len
as used in biotop.py
can be found there as well. Plus the warning to never access the variable directly ;)
Attaching to a kprobe with bpftrace
Since I’m not planning on writing code in bcc I decided to use bpftrace
for my experiment instead. It required a little bit of effort to work out how to access req->__data_len
, but it was fun. First of all, when using kprobes you don’t get the same luxury as with tracepoints and use args->some_arg
. You have to look at the function header and enumerate the arguments yourself.
Thankfully this was trivial in my case: there is only one argument to blk_mq_start_request
: struct request *rq
and therefore I need to use arg0
. If there were further arguments to blk_mq_start_request
they’d be arg1
to argN
.
Sadly though you can’t simply use arg0
in our case in a printf()
call in bpfrace
without further casting it to something useful, or you’ll get an error. If you want to access data structures like struct request
, you need to tell bpftrace
about them first. Arguments to kernel functions using primitive data types (int, long, etc) can be used straight away in printf()
. With that in mind, the following little script should do the trick:
[opc@oracle-19c-fs ~]$ cat blkmq.bt #include <linux/blkdev.h> kprobe:blk_mq_start_request { printf("comm: %s, data len: %d\n", comm, ((struct request *)arg0)->__data_len ); }
It’s actually as simple as it looks, the only issue I faced was making the struct known to bpftrace
. This is what happens:
- The
#include
directive tellsbpftrace
aboutstruct request
inlinux/blkdev.h
- The probe definition (just 1, I wanted to test the concept) is the same as in
biotop
firing onblk_mq_start_request()
- The action is a simple print command, reporting the process (
comm
is a built-in variable) and the__data_len
I shouldn’t access directly - It requires the struct definition as well as the cast to
struct request *
to get hold of the__data_len
field
Please remember the warnings about tracing/profiling and production use. Also be aware that kprobes are considered an unstable interface, the use of tracepoints is preferred.
The bpftrace
script fires far too many times; in other words it’s pretty useless apart from testing whether I can get the data length from the request pointer, but it proves the concept:
[opc@oracle-19c-fs ~]$ sudo bpftrace blkmq.bt Attaching 1 probe... comm: ora_dbw0_orcl, data len: 16384 comm: ora_dbw0_orcl, data len: 16384 comm: ora_dbw0_orcl, data len: 16384 comm: ora_dbw0_orcl, data len: 8192 comm: ora_dbw0_orcl, data len: 8192 comm: ora_dbw0_orcl, data len: 8192 comm: ora_dbw0_orcl, data len: 8192 comm: ora_dbw0_orcl, data len: 8192 comm: ora_dbw0_orcl, data len: 8192
And … voila! The probe fired, my printf()
statement showed the requested I/O size so something else must have been wrong with biotop
as shipped with bcc-tools-0.16.0-3.el8.x86_64
. And sure enough, after grabbing the latest bcc tools from github, biotop
showed proper values in the kbytes column.