Linux BPFtrace: slightly more advanced use cases involving kprobes

If you look at some of the examples of how to use bpftrace – more specifically how to enable kprobes – you might wonder how to use the tool in more advanced cases. This article was prompted by my research into why biotop as part of the bcc tools initially reported 0 kb read/written on my Oracle Linux 8.4/UEK6 (5.4.17-2102.203.6.el8uek.x86_64) VM. I’ll try to show you how to work with kernel data structures in bpftrace in a way similar to what biotop does. In the end, it turned out I used an outdated version of biotop.

Mandatory warning about using profilers and tracers

As with every tracer/profiler, using tools such as bpftrace, the bcc toolset (or said other profiler/tracers) outside of lab/playground environments is strongly discouraged as it can unexpectedly change a process’s behaviour, slow it down or even crash the process under investigation This article is for educational purposes only and purely to be used for research on a lab VM where potential damage is limited to the individual experimenting with the tools. Never use this in production!

Strange: biotop reports an I/O size of 0 kb for all processes

While writing another post about LVM and its linear model of extent allocation I wanted to check which process on my system wrote to which block device. biotop should be well suited to answer that question, yet while it did succeed in listing processes and their I/O against block devices, it fell short in reporting the amount of I/O performed in the version provided by my distribution:

[opc@oracle-19c-fs ~]$ sudo /usr/share/bcc/tools/biotop -C 5 2
Tracing... Output every 5 secs. Hit Ctrl-C to end

13:00:23 loadavg: 1.56 0.40 1.27 7/678 74843

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
8080   ora_lg01_orcl    W 8   16  sdb        699     0.0   0.52
74713  oracle_74713_or  R 8   16  sdb          3     0.0   0.54
74476  kworker/u32:1    W 8   0   sda          1     0.0   1.18
1067   xfsaild/dm-0     W 8   0   sda          1     0.0   0.76
74711  oracle_74711_or  R 8   16  sdb          7     0.0   0.53
70091  kworker/u32:3    W 8   0   sda          1     0.0   0.64
74721  oracle_74721_or  R 8   16  sdb          4     0.0   0.80
74708  oracle_74708_or  R 8   16  sdb          8     0.0   0.51
8074   ora_lg00_orcl    W 8   16  sdb       2632     0.0   0.54
8062   ora_dbw1_orcl    W 8   16  sdb       8660     0.0   0.54
8070   ora_ckpt_orcl    W 8   48  sdd          1     0.0   0.90
73549  kworker/6:2      W 8   0   sda          1     0.0   1.13
74108  kworker/3:1      W 8   0   sda          1     0.0   0.79
8058   ora_dbw0_orcl    W 8   16  sdb       8655     0.0   0.54
74717  oracle_74717_or  R 8   16  sdb          2     0.0   0.55
74719  oracle_74719_or  R 8   16  sdb          3     0.0   0.58
74715  oracle_74715_or  R 8   16  sdb          3     0.0   0.51
8070   ora_ckpt_orcl    W 8   16  sdb          1     0.0   0.73
74723  oracle_74723_or  R 8   16  sdb          3     0.0   0.56

13:00:28 loadavg: 2.07 0.52 1.30 6/678 74843

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
8080   ora_lg01_orcl    W 8   16  sdb        713     0.0   0.51
2767   xfsaild/dm-2     W 8   64  sde          1     0.0   0.73
63332  kworker/5:2      W 8   64  sde          1     0.0   0.54
74713  oracle_74713_or  R 8   16  sdb          1     0.0   0.61
74711  oracle_74711_or  R 8   16  sdb          3     0.0   0.66
74721  oracle_74721_or  R 8   16  sdb          3     0.0   0.53
74152  kworker/12:3     W 8   16  sdb          1     0.0   0.45
74708  oracle_74708_or  R 8   16  sdb          3     0.0   0.54
8074   ora_lg00_orcl    W 8   16  sdb       2622     0.0   0.54
8062   ora_dbw1_orcl    W 8   16  sdb       8706     0.0   0.54
8070   ora_ckpt_orcl    W 8   48  sdd          2     0.0   0.79
2994   xfsaild/dm-4     W 8   16  sdb          1     0.0   0.46
8058   ora_dbw0_orcl    W 8   16  sdb       8715     0.0   0.54
74717  oracle_74717_or  R 8   16  sdb          5     0.0   0.54
74719  oracle_74719_or  R 8   16  sdb          4     0.0   0.53
74715  oracle_74715_or  R 8   16  sdb          3     0.0   0.58
8070   ora_ckpt_orcl    W 8   16  sdb          2     0.0   0.72
74723  oracle_74723_or  R 8   16  sdb          3     0.0   0.48
Detaching...

Here you can see Swingench at work, it’s mainly the log writer and database writer busy performing I/O. I didn’t fail to notice that every single value in kbytes was 0. Yet clearly there is I/O going on:

[opc@oracle-19c-fs ~]$ iostat -xmz 5 
Linux 5.4.17-2102.203.6.el8uek.x86_64 (oracle-19c-fs) 	04/08/21 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.47    0.01    4.26    6.11    0.05   75.10

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              1.63    1.12      0.05      0.03     0.03     0.42  ...   0.15
dm-0             1.59    0.95      0.04      0.02     0.00     0.00  ...   0.13
dm-1             0.02    0.58      0.00      0.01     0.00     0.00  ...   0.02
sdb             85.70 2903.19      0.94     25.58     0.00     0.22  ...  71.99
sdc              0.02    0.00      0.00      0.00     0.00     0.00  ...   0.00
sdd              5.67  287.03      0.02      2.27     0.00     0.84  ...  33.25
sde              0.50    0.12      0.02      0.00     0.00     0.03  ...   0.04
dm-2             0.49    0.15      0.02      0.00     0.00     0.00  ...   0.04
dm-3             0.01    0.00      0.00      0.00     0.00     0.00  ...   0.00
dm-4            91.35 3191.47      0.97     27.85     0.00     0.00  ...  72.02

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          20.25    0.00    6.01    8.44    0.10   65.20

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              0.00    0.20      0.00      0.00     0.00     0.00  ...   0.04
dm-0             0.00    0.20      0.00      0.00     0.00     0.00  ...   0.04
sdb              9.40 4108.80      0.08     37.29     0.00     0.20  ...  99.70
sdd              0.00    0.40      0.00      0.01     0.00     0.00  ...   0.08
dm-4             9.40 4109.80      0.08     37.29     0.00     0.00  ...  99.70

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          20.85    0.00    6.00    5.75    0.06   67.34

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  ...  %util
sda              0.00    1.40      0.00      0.01     0.00     0.20  ...   0.06
dm-0             0.00    0.20      0.00      0.00     0.00     0.00  ...   0.02
dm-1             0.00    1.40      0.00      0.01     0.00     0.00  ...   0.04
sdb              9.80 4098.40      0.10     36.41     0.00     1.20  ...  99.86
sdd              0.00    1.20      0.00      0.02     0.00     0.00  ...   0.20
dm-4             9.80 4100.60      0.10     36.43     0.00     0.00  ...  99.84

^C

Linux is writing roughly 37 MB/s to /dev/sdb. A little investigation seemed in order.

Trying to understand biotop

The nice thing about open source software is that you can read the code. If you are unsure about the Linux storage stack like me, you might find the Linux storage stack diagram useful. If I understand the bcc/python interaction correctly (and there is a high probability that I might not!), biotop.py written by Brendan Gregg attaches to the following kprobes:

  • blk_account_io_start
  • blk_start_request (if available)
  • blk_mq_start_request
  • blk_account_io_done

At the start of the blk-mq request (I’m using Oracle Linux 8 where blk-mq is enabled by default) the callback in biotop reads the request data and stores it temporarily for later use. More specifically, the code uses req->__data_len to get the size of the I/O request.

A kprobe allows me to dynamically attach a piece of code to a kernel function. To better understand the function I need to know more about it. But where do I find the function definition in the kernel code? As per the UEK readme I should probably look at branch uek6/master which is close enough to my kernel release. I found the call to blk_mq_start_request on line 732 in block/blk-mq.c:

...

/**
 * blk_mq_start_request - Start processing a request
 * @rq: Pointer to request to be started
 *
 * Function used by device drivers to notify the block layer that a request
 * is going to be processed now, so blk layer can do proper initializations
 * such as starting the timeout timer.
 */
void blk_mq_start_request(struct request *rq)
{

...

That request pointer looks familiar, it’s the same as in biotop.py. After a little bit of searching I found struct request in include/linux/blkdev.h beginning at line 127:

/*
 * Try to put the fields that are referenced together in the same cacheline.
 *
 * If you modify this structure, make sure to update blk_rq_init() and
 * especially blk_mq_rq_ctx_init() to take care of the added fields.
 */
struct request {
	struct request_queue *q;
	struct blk_mq_ctx *mq_ctx;
	struct blk_mq_hw_ctx *mq_hctx;

	unsigned int cmd_flags;		/* op and common flags */
	req_flags_t rq_flags;

	int tag;
	int internal_tag;

	/* the following two fields are internal, NEVER access directly */
	unsigned int __data_len;	/* total data len */
	sector_t __sector;		/* sector cursor */

	struct bio *bio;

And yes, __data_len as used in biotop.py can be found there as well. Plus the warning to never access the variable directly ;)

Attaching to a kprobe with bpftrace

Since I’m not planning on writing code in bcc I decided to use bpftrace for my experiment instead. It required a little bit of effort to work out how to access req->__data_len, but it was fun. First of all, when using kprobes you don’t get the same luxury as with tracepoints and use args->some_arg. You have to look at the function header and enumerate the arguments yourself.

Thankfully this was trivial in my case: there is only one argument to blk_mq_start_request: struct request *rq and therefore I need to use arg0. If there were further arguments to blk_mq_start_request they’d be arg1 to argN.

Sadly though you can’t simply use arg0 in our case in a printf() call in bpfrace without further casting it to something useful, or you’ll get an error. If you want to access data structures like struct request, you need to tell bpftrace about them first. Arguments to kernel functions using primitive data types (int, long, etc) can be used straight away in printf(). With that in mind, the following little script should do the trick:

[opc@oracle-19c-fs ~]$ cat blkmq.bt 
#include <linux/blkdev.h>

kprobe:blk_mq_start_request {
	printf("comm: %s, data len: %d\n", comm, ((struct request *)arg0)->__data_len ); 
}

It’s actually as simple as it looks, the only issue I faced was making the struct known to bpftrace. This is what happens:

  • The #include directive tells bpftrace about struct request in linux/blkdev.h
  • The probe definition (just 1, I wanted to test the concept) is the same as in biotop firing on blk_mq_start_request()
  • The action is a simple print command, reporting the process (comm is a built-in variable) and the __data_len I shouldn’t access directly
  • It requires the struct definition as well as the cast to struct request * to get hold of the __data_len field

Please remember the warnings about tracing/profiling and production use. Also be aware that kprobes are considered an unstable interface, the use of tracepoints is preferred.

The bpftrace script fires far too many times; in other words it’s pretty useless apart from testing whether I can get the data length from the request pointer, but it proves the concept:

[opc@oracle-19c-fs ~]$ sudo bpftrace blkmq.bt 
Attaching 1 probe...
comm: ora_dbw0_orcl, data len: 16384
comm: ora_dbw0_orcl, data len: 16384
comm: ora_dbw0_orcl, data len: 16384
comm: ora_dbw0_orcl, data len: 8192
comm: ora_dbw0_orcl, data len: 8192
comm: ora_dbw0_orcl, data len: 8192
comm: ora_dbw0_orcl, data len: 8192
comm: ora_dbw0_orcl, data len: 8192
comm: ora_dbw0_orcl, data len: 8192

And … voila! The probe fired, my printf() statement showed the requested I/O size so something else must have been wrong with biotop as shipped with bcc-tools-0.16.0-3.el8.x86_64. And sure enough, after grabbing the latest bcc tools from github, biotop showed proper values in the kbytes column.

Advertisement