Last night while I was working on my POUG 2021 presentation I looked up what’s new with perf(1) since I last spoke about profiling and tracing Linux on-premises and in the cloud back in 2017. Unsurprisingly, quite a few things have changed.
A quick warning before we begin
As with every tracer/profiler, using tools such as perf trace(1) and strace(1) outside of lab/playground environments is discouraged as it can unexpectedly change a process’s behaviour, slow it down or even crash the process under investigation This article is for educational purposes only and purely to be used for research on a lab VM where potential damage is limited to the individual experimenting with the tools.
For these reasons you certainly don’t want to use
strace or any other ptrace(2)-based tool such as gdb(1) to figure out what’s going on in a production environment. People a lot smarter than me have written about that, for example Brendan Gregg: wow, so much syscall. It’s a great read, please have a look at the article.
Recently I came across a new option to
perf trace which seems to be a much better suited tool for research as it seems to be far less intrusive.
Just to prove the point on Ubuntu 20.04.2 with kernel 5.8.0-63-generic, this is what it comes down to. The classic example showing the effect of
strace on processes involves running dd(1) but that’s been used so many times I thought of something else. Since I’m researching pread(2) a lot right now,
fio seems appropriate. When using the psync IO engine I should be able to record quite a few I/O-related system calls.
The following versions have been used when writing the article:
- Ubuntu 20.04.2 LTS
- Kernel 5.8.0-63-generic
- strace 5.5
- perf as per linux-tools-5.8.0-63-generic
- fio 3.16
I have to admit that fio 3.16 isn’t the latest and greatest version, but it ships with Ubuntu making it the convenient solution.
Creation of a baseline without profiling/tracing
The first test is to run
fio without profiling to establish a baseline:
$ fio --name=test1 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync test1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1 fio-3.16 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=52.5MiB/s][r=13.4k IOPS][eta 00m:00s] test1: (groupid=0, jobs=1): err= 0: pid=12472: Wed Jul 28 21:36:45 2021 read: IOPS=13.2k, BW=51.5MiB/s (53.9MB/s)(1000MiB/19436msec) clat (usec): min=62, max=2244, avg=74.47, stdev= 6.20 lat (usec): min=62, max=2244, avg=74.63, stdev= 6.23 clat percentiles (usec): | 1.00th=[ 69], 5.00th=[ 71], 10.00th=[ 71], 20.00th=[ 72], | 30.00th=[ 73], 40.00th=[ 74], 50.00th=[ 75], 60.00th=[ 75], | 70.00th=[ 76], 80.00th=[ 77], 90.00th=[ 79], 95.00th=[ 82], | 99.00th=[ 93], 99.50th=[ 94], 99.90th=[ 105], 99.95th=[ 118], | 99.99th=[ 123] bw ( KiB/s): min=44568, max=53872, per=100.00%, avg=52684.42, stdev=1549.03, samples=38 iops : min=11142, max=13468, avg=13171.11, stdev=387.26, samples=38 lat (usec) : 100=99.84%, 250=0.15%, 500=0.01%, 750=0.01% lat (msec) : 4=0.01% cpu : usr=2.87%, sys=14.03%, ctx=256017, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=256000,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=51.5MiB/s (53.9MB/s), 51.5MiB/s-51.5MiB/s (53.9MB/s-53.9MB/s), io=1000MiB (1049MB), run=19436-19436msec Disk stats (read/write): nvme0n1: ios=255515/1, merge=0/0, ticks=16882/3, in_queue=16888, util=99.18% $
Undoubtedly the above is one of the simplest uses of
fio. I asked it to perform random reads on my NVMe device, at a 4k block size with the IO engine instructed to use
pread() calls to perform I/O. The following list summarises the important performance details:
- IOPS: 13.200
- Storage bandwidth: 51.5 MiB/s
- Completion time: 19436 milliseconds.
And no, this isn’t anywhere near the NVMe device’s performance but that’s not the point ;)
Interlude: the tricky bit profiling/tracing fio worker processes
Unfortunately I can’t simply
fio command as it creates what I’d like to call “controller” process for the lack of better word. The “controller” is attached to the pts (=terminal). There are further worker processes for each job. In my previous example I only ran one concurrent job, but you can ask
fio to run multiple jobs in parallel by specifying
--numjobs. Here is an example running the default number of jobs:
$ fio --name=test2 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync
ps I can see 2
$ ps -ef | grep 'martin.*fio.*psync' martin 15280 7841 3 22:10 pts/0 00:00:00 fio --name=test2 --rw=randread ... martin 15294 15280 17 22:10 ? 00:00:01 fio --name=test2 --rw=randread ...
The one attached to pts/0 doesn’t really do any of the I/O work, it seems to be limited to communication with the worker(s).
The trick is to attach
strace to the worker process as soon as it is started. I could have used
strace -f fio ... to follow fio’s child process creation but then I’d end up with information about 2 processes in the trace when I really only wanted one.
Tracing fio execution using strace
This required two terminal sessions: session 1 is used to start
fio, session 2 will host the
strace command. For the record, this is the command I used to attach
strace to the
fio worker process:
$ sudo strace -cp $(pgrep -u martin fio | sort | tail -1)
The above command only works for a single worker process as you can imagine. Let’s have a look at the fio output after strace was attached to it:
$ fio --name=test2 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync test2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1 fio-3.16 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=37.4MiB/s][r=9585 IOPS][eta 00m:00s] test2: (groupid=0, jobs=1): err= 0: pid=12797: Wed Jul 28 22:36:45 2021 read: IOPS=9457, BW=36.9MiB/s (38.7MB/s)(1000MiB/27069msec) clat (usec): min=72, max=226788, avg=103.92, stdev=563.06 lat (usec): min=72, max=226789, avg=104.07, stdev=563.06 clat percentiles (usec): | 1.00th=[ 86], 5.00th=[ 91], 10.00th=[ 94], 20.00th=[ 96], | 30.00th=[ 98], 40.00th=[ 100], 50.00th=[ 101], 60.00th=[ 103], | 70.00th=[ 105], 80.00th=[ 108], 90.00th=[ 112], 95.00th=[ 117], | 99.00th=[ 133], 99.50th=[ 139], 99.90th=[ 172], 99.95th=[ 223], | 99.99th=[ 701] bw ( KiB/s): min= 1184, max=39872, per=99.98%, avg=37822.56, stdev=5116.95, samples=54 iops : min= 296, max= 9968, avg=9455.63, stdev=1279.23, samples=54 lat (usec) : 100=42.53%, 250=57.43%, 500=0.02%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 250=0.01% cpu : usr=2.83%, sys=23.20%, ctx=768059, majf=0, minf=11 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=256000,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=1000MiB (1049MB), run=27069-27069msec Disk stats (read/write): nvme0n1: ios=254356/0, merge=0/0, ticks=17648/0, in_queue=17648, util=99.39%
Performance is down significantly compared to the first execution:
- 9457 vs 13.200 IOPS have been recorded
- A bandwidth reduction to 36.9MiB/s instead of 51.5 MiB/s
- All work has been completed in 27069 instead of 19436 milliseconds.
Not quite the same dramatic reduction in performance you typically see with dd, but still significant. Did I mention that
strace slows things down?
Using perf trace
Following the same approach as just described, it is possible to
perf trace the
$ sudo perf trace -s -p $(pgrep -u martin fio | sort | tail -1)
Here is the corresponding
fio output (
perf trace was busy recording what’s been going on)
$ fio --name=test3 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync test3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1 fio-3.16 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=45.9MiB/s][r=11.8k IOPS][eta 00m:00s] test3: (groupid=0, jobs=1): err= 0: pid=12960: Wed Jul 28 22:40:12 2021 read: IOPS=11.6k, BW=45.1MiB/s (47.3MB/s)(1000MiB/22160msec) clat (usec): min=65, max=264743, avg=84.96, stdev=523.12 lat (usec): min=65, max=264744, avg=85.15, stdev=523.13 clat percentiles (usec): | 1.00th=[ 73], 5.00th=[ 75], 10.00th=[ 77], 20.00th=[ 81], | 30.00th=[ 82], 40.00th=[ 84], 50.00th=[ 85], 60.00th=[ 85], | 70.00th=[ 86], 80.00th=[ 88], 90.00th=[ 91], 95.00th=[ 93], | 99.00th=[ 105], 99.50th=[ 113], 99.90th=[ 125], 99.95th=[ 131], | 99.99th=[ 151] bw ( KiB/s): min=18656, max=47512, per=99.98%, avg=46200.75, stdev=4265.17, samples=44 iops : min= 4664, max=11878, avg=11550.18, stdev=1066.29, samples=44 lat (usec) : 100=98.46%, 250=1.54%, 500=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 500=0.01% cpu : usr=2.93%, sys=22.58%, ctx=256053, majf=0, minf=12 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=256000,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=45.1MiB/s (47.3MB/s), 45.1MiB/s-45.1MiB/s (47.3MB/s-47.3MB/s), io=1000MiB (1049MB), run=22160-22160msec Disk stats (read/write): nvme0n1: ios=255885/0, merge=0/0, ticks=17328/0, in_queue=17329, util=99.49%
Let’s put each run into perspective:
|Metric||Without tracing/ profiling||Tracing: strace||Change vs baseline||tracing: perf trace||change vs baseline|
|Bandwidth||51.5 MiB/s||36.9 MiB/s||72%||45.1 MiB/s||88%|
Please don’t extrapolate anything from the figures in the table, I haven’t run the benchmarks 10 times to get the averages, I just played around a bit to get a rough understanding of the matter. The important take away is that tracing and profiling has an effect on performance, and can cause trouble up to and including crashing processes. See my earlier warning :)
I hope to show you in this blog post that tracing and profiling have an impact – as you can see in the above table. However there are differences in the tools used, with
perf trace looking a lot better than
strace. By the way I investigated the use of
strace but it’s not quite the same as
perf trace as you have to pass the system calls you are interested in upfront. Attaching to running processes using the
-p PID flag also doesn’t work in my version of
The measurement intrusion effect with
strace reduces throughput and IOPS to 72% of what was recorded with the baseline. It also took 1.4 times longer to complete the
fio benchmark. Using perf trace is better, but not perfect. There is a 12 percent drop in IOPS and bandwidth, at the same time it takes 15% longer to complete the
By the way I found the effect of perf trace exacerbated in VMs. This makes sense when considering how virtualisation works. This article was written on actual physical hardware without any hypervisor in between the O/S and the hardware.
As a rule of thumb, the higher the number of system calls, the higher the effect of any tool investigating them.