Do you know perf trace? It’s an almost perfect replacement for strace

Last night while I was working on my POUG 2021 presentation I looked up what’s new with perf(1) since I last spoke about profiling and tracing Linux on-premises and in the cloud back in 2017. Unsurprisingly, quite a few things have changed.

A quick warning before we begin

As with every tracer/profiler, using tools such as perf trace(1) and strace(1) outside of lab/playground environments is discouraged as it can unexpectedly change a process’s behaviour, slow it down or even crash the process under investigation This article is for educational purposes only and purely to be used for research on a lab VM where potential damage is limited to the individual experimenting with the tools.

For these reasons you certainly don’t want to use strace or any other ptrace(2)-based tool such as gdb(1) to figure out what’s going on in a production environment. People a lot smarter than me have written about that, for example Brendan Gregg: wow, so much syscall. It’s a great read, please have a look at the article.

Performance example

Recently I came across a new option to perf, named perf trace which seems to be a much better suited tool for research as it seems to be far less intrusive.

Just to prove the point on Ubuntu 20.04.2 with kernel 5.8.0-63-generic, this is what it comes down to. The classic example showing the effect of strace on processes involves running dd(1) but that’s been used so many times I thought of something else. Since I’m researching pread(2) a lot right now, fio seems appropriate. When using the psync IO engine I should be able to record quite a few I/O-related system calls.

The following versions have been used when writing the article:

  • Ubuntu 20.04.2 LTS
  • Kernel 5.8.0-63-generic
  • strace 5.5
  • perf as per linux-tools-5.8.0-63-generic
  • fio 3.16

I have to admit that fio 3.16 isn’t the latest and greatest version, but it ships with Ubuntu making it the convenient solution.

Creation of a baseline without profiling/tracing

The first test is to run fio without profiling to establish a baseline:

$ fio --name=test1 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync
test1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=52.5MiB/s][r=13.4k IOPS][eta 00m:00s]
test1: (groupid=0, jobs=1): err= 0: pid=12472: Wed Jul 28 21:36:45 2021
  read: IOPS=13.2k, BW=51.5MiB/s (53.9MB/s)(1000MiB/19436msec)
    clat (usec): min=62, max=2244, avg=74.47, stdev= 6.20
     lat (usec): min=62, max=2244, avg=74.63, stdev= 6.23
    clat percentiles (usec):
     |  1.00th=[   69],  5.00th=[   71], 10.00th=[   71], 20.00th=[   72],
     | 30.00th=[   73], 40.00th=[   74], 50.00th=[   75], 60.00th=[   75],
     | 70.00th=[   76], 80.00th=[   77], 90.00th=[   79], 95.00th=[   82],
     | 99.00th=[   93], 99.50th=[   94], 99.90th=[  105], 99.95th=[  118],
     | 99.99th=[  123]
   bw (  KiB/s): min=44568, max=53872, per=100.00%, avg=52684.42, stdev=1549.03, samples=38
   iops        : min=11142, max=13468, avg=13171.11, stdev=387.26, samples=38
  lat (usec)   : 100=99.84%, 250=0.15%, 500=0.01%, 750=0.01%
  lat (msec)   : 4=0.01%
  cpu          : usr=2.87%, sys=14.03%, ctx=256017, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=256000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=51.5MiB/s (53.9MB/s), 51.5MiB/s-51.5MiB/s (53.9MB/s-53.9MB/s), io=1000MiB (1049MB), run=19436-19436msec

Disk stats (read/write):
  nvme0n1: ios=255515/1, merge=0/0, ticks=16882/3, in_queue=16888, util=99.18%
$ 

Undoubtedly the above is one of the simplest uses of fio. I asked it to perform random reads on my NVMe device, at a 4k block size with the IO engine instructed to use pread() calls to perform I/O. The following list summarises the important performance details:

  • IOPS: 13.200
  • Storage bandwidth: 51.5 MiB/s
  • Completion time: 19436 milliseconds.

And no, this isn’t anywhere near the NVMe device’s performance but that’s not the point ;)

Interlude: the tricky bit profiling/tracing fio worker processes

Unfortunately I can’t simply strace the fio command as it creates what I’d like to call “controller” process for the lack of better word. The “controller” is attached to the pts (=terminal). There are further worker processes for each job. In my previous example I only ran one concurrent job, but you can ask fio to run multiple jobs in parallel by specifying --numjobs. Here is an example running the default number of jobs:

$ fio --name=test2 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync

Using ps I can see 2 fio processes:

$ ps -ef | grep 'martin.*fio.*psync'
martin     15280    7841  3 22:10 pts/0    00:00:00 fio --name=test2 --rw=randread ...
martin     15294   15280 17 22:10 ?        00:00:01 fio --name=test2 --rw=randread ...

The one attached to pts/0 doesn’t really do any of the I/O work, it seems to be limited to communication with the worker(s).

The trick is to attach strace to the worker process as soon as it is started. I could have used strace -f fio ... to follow fio’s child process creation but then I’d end up with information about 2 processes in the trace when I really only wanted one.

Tracing fio execution using strace

This required two terminal sessions: session 1 is used to start fio, session 2 will host the strace command. For the record, this is the command I used to attach strace to the fio worker process:

$ sudo strace -cp $(pgrep -u martin fio | sort | tail -1)

The above command only works for a single worker process as you can imagine. Let’s have a look at the fio output after strace was attached to it:

$ fio --name=test2 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync 
test2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=37.4MiB/s][r=9585 IOPS][eta 00m:00s]
test2: (groupid=0, jobs=1): err= 0: pid=12797: Wed Jul 28 22:36:45 2021
  read: IOPS=9457, BW=36.9MiB/s (38.7MB/s)(1000MiB/27069msec)
    clat (usec): min=72, max=226788, avg=103.92, stdev=563.06
     lat (usec): min=72, max=226789, avg=104.07, stdev=563.06
    clat percentiles (usec):
     |  1.00th=[   86],  5.00th=[   91], 10.00th=[   94], 20.00th=[   96],
     | 30.00th=[   98], 40.00th=[  100], 50.00th=[  101], 60.00th=[  103],
     | 70.00th=[  105], 80.00th=[  108], 90.00th=[  112], 95.00th=[  117],
     | 99.00th=[  133], 99.50th=[  139], 99.90th=[  172], 99.95th=[  223],
     | 99.99th=[  701]
   bw (  KiB/s): min= 1184, max=39872, per=99.98%, avg=37822.56, stdev=5116.95, samples=54
   iops        : min=  296, max= 9968, avg=9455.63, stdev=1279.23, samples=54
  lat (usec)   : 100=42.53%, 250=57.43%, 500=0.02%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 250=0.01%
  cpu          : usr=2.83%, sys=23.20%, ctx=768059, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=256000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=1000MiB (1049MB), run=27069-27069msec

Disk stats (read/write):
  nvme0n1: ios=254356/0, merge=0/0, ticks=17648/0, in_queue=17648, util=99.39%

Performance is down significantly compared to the first execution:

  • 9457 vs 13.200 IOPS have been recorded
  • A bandwidth reduction to 36.9MiB/s instead of 51.5 MiB/s
  • All work has been completed in 27069 instead of 19436 milliseconds.

Not quite the same dramatic reduction in performance you typically see with dd, but still significant. Did I mention that strace slows things down?

Using perf trace

Following the same approach as just described, it is possible to perf trace the fio worker:

$ sudo perf trace -s -p $(pgrep -u martin fio | sort | tail -1)

Here is the corresponding fio output (perf trace was busy recording what’s been going on)

$ fio --name=test3 --rw=randread --size=1000M --filename=/nvme/fio/test --ioengine=psync 
test3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=45.9MiB/s][r=11.8k IOPS][eta 00m:00s]
test3: (groupid=0, jobs=1): err= 0: pid=12960: Wed Jul 28 22:40:12 2021
  read: IOPS=11.6k, BW=45.1MiB/s (47.3MB/s)(1000MiB/22160msec)
    clat (usec): min=65, max=264743, avg=84.96, stdev=523.12
     lat (usec): min=65, max=264744, avg=85.15, stdev=523.13
    clat percentiles (usec):
     |  1.00th=[   73],  5.00th=[   75], 10.00th=[   77], 20.00th=[   81],
     | 30.00th=[   82], 40.00th=[   84], 50.00th=[   85], 60.00th=[   85],
     | 70.00th=[   86], 80.00th=[   88], 90.00th=[   91], 95.00th=[   93],
     | 99.00th=[  105], 99.50th=[  113], 99.90th=[  125], 99.95th=[  131],
     | 99.99th=[  151]
   bw (  KiB/s): min=18656, max=47512, per=99.98%, avg=46200.75, stdev=4265.17, samples=44
   iops        : min= 4664, max=11878, avg=11550.18, stdev=1066.29, samples=44
  lat (usec)   : 100=98.46%, 250=1.54%, 500=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 500=0.01%
  cpu          : usr=2.93%, sys=22.58%, ctx=256053, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=256000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=45.1MiB/s (47.3MB/s), 45.1MiB/s-45.1MiB/s (47.3MB/s-47.3MB/s), io=1000MiB (1049MB), run=22160-22160msec

Disk stats (read/write):
  nvme0n1: ios=255885/0, merge=0/0, ticks=17328/0, in_queue=17329, util=99.49%

Performance Summary

Let’s put each run into perspective:

MetricWithout tracing/ profilingTracing: straceChange vs baselinetracing: perf tracechange vs baseline
IOPS13200945772%1160088%
Bandwidth51.5 MiB/s36.9 MiB/s72%45.1 MiB/s88%
Duration1943627069140%22160115%
Performance impact of strace and perf trace on a fio workload

Please don’t extrapolate anything from the figures in the table, I haven’t run the benchmarks 10 times to get the averages, I just played around a bit to get a rough understanding of the matter. The important take away is that tracing and profiling has an effect on performance, and can cause trouble up to and including crashing processes. See my earlier warning :)

Summary

I hope to show you in this blog post that tracing and profiling have an impact – as you can see in the above table. However there are differences in the tools used, with perf trace looking a lot better than strace. By the way I investigated the use of --seccomp-bpf with strace but it’s not quite the same as perf trace as you have to pass the system calls you are interested in upfront. Attaching to running processes using the -p PID flag also doesn’t work in my version of strace.

The measurement intrusion effect with strace reduces throughput and IOPS to 72% of what was recorded with the baseline. It also took 1.4 times longer to complete the fio benchmark. Using perf trace is better, but not perfect. There is a 12 percent drop in IOPS and bandwidth, at the same time it takes 15% longer to complete the fio run.

By the way I found the effect of perf trace exacerbated in VMs. This makes sense when considering how virtualisation works. This article was written on actual physical hardware without any hypervisor in between the O/S and the hardware.

As a rule of thumb, the higher the number of system calls, the higher the effect of any tool investigating them.

Advertisement