Author: Wenbo Zhang (Linux Kernel Engineer of the EE team at PingCAP)
Transcreator: Tom Xiong; Editor: Tom Dewan
If you need to dynamically trace Linux process system calls, you might first consider strace. strace is simple to use and works well for issues such as “Why can't the software run on this machine?” However, if you're running a trace in a production environment, strace is NOT a good choice. It introduces a substantial amount of overhead. According to a performance test conducted by Arnaldo Carvalho de Melo, a senior software engineer at Red Hat, the process traced using strace ran 173 times slower, which is disastrous for a production environment.
So are there any tools that excel at tracing system calls in a production environment? The answer is YES. This blog post introduces perf and traceloop, two commonly used command-line tools, to help you trace system calls in a production environment.
perf is a powerful Linux profiling tool, refined and upgraded by Linux kernel developers. In addition to common features such as analyzing Performance Monitoring Unit (PMU) hardware events and kernel events, perf has the following subcomponents:
dd
command.Let's look at some common uses of perf.
To see which commands made the most system calls:
perf top -F 49 -e raw_syscalls:sys_enter --sort comm,dso --show-nr-samples
From the output, you can see that the kube-apiserver
command had the most system calls during sampling.
To see system calls that have latencies longer than a specific duration. In the following example, this duration is 200 milliseconds:
perf trace --duration 200
From the output, you can see the process names, process IDs (PIDs), the specific system calls that exceed 200 ms, and the returned values.
To see the processes that had system calls within a period of time and a summary of their overhead:
perf trace -p $PID -s
From the output, you can see the times of each system call, the times of the errors, the total latency, the average latency, and so on.
To analyze the stack information of calls that have a high latency:
perf trace record --call-graph dwarf -p $PID -- sleep 10
To trace a group of tasks. For example, two BPF tools are running in the background. To see their system call information, you can add them to a perf_event
cgroup and then execute per trace
:
mkdir /sys/fs/cgroup/perf_event/bpftools/
echo 22542 >> /sys/fs/cgroup/perf_event/bpftools/tasks
echo 20514 >> /sys/fs/cgroup/perf_event/bpftools/tasks
perf trace -G bpftools -a -- sleep 10
Those are some of the most common uses of perf. If you'd like to know more (especially about perf-trace), see the Linux manual page. From the manual pages, you will learn that perf-trace can filter tasks based on PIDs or thread IDs (TIDs), but that it has no convenient support for containers and the Kubernetes (K8s) environments. Don't worry. Next, we'll discuss a tool that can easily trace system calls in containers and in K8s environments that uses cgroup v2.
Traceloop provides better support for tracing Linux system calls in the containers or K8s environments that use cgroup v2. You might be unfamiliar with traceloop but know BPF Compiler Collection (BCC) pretty well. (Its front-end is implemented using Python or C++.) In the IO Visor Project, BCC's parent project, there is another project named gobpf that provides Golang bindings for the BCC framework. Based on gobpf, traceloop is developed for environments of containers and K8s. The following illustration shows the traceloop architecture:
We can further simplify this illustration into the following key procedures. Note that these procedures are implementation details, not operations to perform:
bpf helper
gets the cgroup ID. Tasks are filtered based on the cgroup ID rather than on the PID and TID.Note:
Currently, you can get the cgroup ID only by executing
bpf helper: bpf_get_current_cgroup_id
, and this ID is available only in cgroup v2. Therefore, before you use traceloop, make sure that cgroup v2 is enabled in your environment.
In the following demo (on the CentOS 8 4.18 kernel), when traceloop exits, the system call information is traced:
sudo -E ./traceloop cgroups --dump-on-exit /sys/fs/cgroup/system.slice/sshd.service
As the results show, the traceloop output is similar to that of strace or perf-trace except for the cgroup-based task filtering. Note that CentOS 8 mounts cgroup v2 directly on the /sys/fs/cgroup
path instead of on /sys/fs/cgroup/unified
as Ubuntu does. Therefore, before you use traceloop, you should run mount -t cgroup2
to determine the mount information.
The team behind traceloop has integrated it with the Inspektor Gadget project, so you can run traceloop on the K8s platform using kubectl. See the demos in Inspektor Gadget - How to use and, if you like, try it on your own.
We conducted a sysbench test in which system calls were either traced using multiple tracers (traceloop, strace, and perf-trace) or not traced. The benchmark results are as follows:
As the benchmark shows, strace caused the biggest decrease in application performance. perf-trace caused a smaller decrease, and traceloop caused the smallest.
For issues such as “Why can't the software run on this machine,” strace is still a powerful system call tracer in Linux. But to trace the latency of system calls, the BPF-based perf-trace is a better option. In containers or K8s environments that use cgroup v2, traceloop is the easiest to use.