Author: Ke'ao Yang (Software Engineer at PingCAP)
Transcreator: Tom Xiong; Editor: Tom Dewan
In a production environment, filesystem faults might occur due to various incidents such as disk failures and administrator errors. As a Chaos Engineering platform, Chaos Mesh has supported simulating I/O faults in a filesystem ever since its early versions. By simply adding an IOChaos CustomResourceDefinition (CRD), we can watch how the filesystem fails and returns errors.
However, before Chaos Mesh 1.0, this experiment was not easy and may have consumed a lot of resources. We needed to inject sidecar containers to the Pod through the mutating admission webhooks and rewrite the ENTRYPOINT
command. Even if no fault was injected, the injected sidecar container caused a substantial amount of overhead.
Chaos Mesh 1.0 has changed all this. Now, we can use IOChaos to inject faults to a filesystem at runtime. This simplifies the process and greatly reduces system overhead. This blog post introduces how we implement the IOChaos experiment without using a sidecar.
To simulate I/O faults at runtime, we need to inject faults into a filesystem after the program starts system calls (such as reads and writes) but before the call requests arrive at the target filesystem. We can do that in one of two ways:
But ChaosFS has several problems:
mnt
namespace. For details, see mount_namespaces(7) — Linux manual page.Before Chaos Mesh 1.0, we used the mutating admission webhook to implement IOChaos. This technique addressed the three problems lists above and allowed us to:
/mnt/a
to /mnt/a_bak
) so that we could mount ChaosFS to the target path (/mnt/a
)./app
to /waitfs.sh /app
.waitfs.sh
script kept checking whether the filesystem was successfully mounted. If it was mounted, /app
was started./mnt
), and then we mounted this volume to the target directory (for example, /mnt/a
). We also properly enabled mount propagation for this volume's mount to penetrate the share to host and then penetrate slave to the target.These three approaches allowed us to inject I/O faults while the program was running. However, the injection was far from convenient:
mv
(rename) with mount move
to move the mount point of the target volume./waitfs.sh
script could not properly start the program after the filesystem was mounted.What about cracking these tough nuts without the mutating admission webhook? Let's get back and think a bit about the reason why we used the mutating admission webhook to add a container in which ChaosFS runs. We do that to mount the filesystem to the target container.
In fact, there is another solution. Instead of adding containers to the Pod, we can first use the setns
Linux system call to modify the namespace of the current process and then use the mount
call to mount ChaosFS to the target container. Suppose that the filesystem to inject is /mnt
. The new injection process is as follows:
setns
for the current process to enter the mnt namespace of the target container.mount --move
to move /mnt
to /mnt_bak
./mnt
and use /mnt_bak
as the backend.After the process is finished, the target container will open, read, and write the files in /mnt
through ChaosFS. In this way, delays or faults are injected much more easily. However, there are still two questions to answer:
ptrace solves both of the two questions above. We can use ptrace to replace the opened file descriptors (FD) at runtime and replace the current working directory (CWD) and mmap.
ptrace is a powerful tool that makes the target process (tracee) to run any system call or binary program. For a tracee to run the program, ptrace modifies the RIP-pointed address to the target process and adds an int3
instruction to trigger a breakpoint. When the binary program stops, we need to restore the registers and memory.
Note:
In the x86_64 architecture, the RIP register (also called an instruction pointer) always points to the memory address at which the next directive is run.
To load the program into the target process memory spaces:
As a best practice, we often replace ptrace POKE_TEXT
writes with process_vm_writev
because if there is a huge amount of data to write, process_vm_writev
performs more efficiently.
Using ptrace, we are able to make a process to replace its own FD. Now we only need a method to make that replacement happen. This method is the dup2
system call.
dup2
to replace file descriptorThe signature of the dup2
function is int dup2(int oldfd, int newfd);
. It is used to create a copy of the old FD (oldfd
). This copy has an FD number of newfd
. If newfd
already corresponds to the FD of an opened file, the FD on the file that's already opened is automatically closed.
For example, the current process opens /var/run/__chaosfs__test__/a
whose FD is 1
. To replace this opened file with /var/run/test/a
, this process performs the following operations:
fcntl
system call to get the OFlags
(the parameter used by the open
system call, such as O_WRONLY
) of /var/run/__chaosfs__test__/a
.Iseek
system call to get the current location of seek
.open
system call to open /var/run/test/a
using the same OFlags
. Assume that the FD is 2
.Iseek
to change the seek
location of the newly opened FD 2
.dup2(2, 1)
to replace the FD 1
of /var/run/__chaosfs__test__/a
with the newly opened FD 2
.2
.After the process is finished, FD 1
of the current process points to /var/run/test/a
. So that we can inject faults, any subsequent operations on the target file go through the Filesystem in Userspace (FUSE). FUSE is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code.
The combined functionality of ptrace and dup2 makes it possible for the tracer to make the tracee replace the opened FD by itself. Now, we need to write a binary program and make the target process run it:
Note:
In the implementation above, we assume that:
- The threads of the target process are POSIX threads and share the opened files.
- When the target process creates threads using the
clone
function, theCLONE_FILES
parameter is passed.Therefore, Chaos Mesh only replaces the FD of the first thread in the thread group.
When the program runs, the FD is replaced at runtime.
The following diagram illustrates the overall I/O fault injection process:
In this diagram, each horizontal line corresponds to a thread that runs in the direction of the arrows. The Mount/Umount Filesystem and Replace FD tasks are carefully arranged in sequence. Given the process above, this arrangement makes a lot of sense.
I've discussed how we implement fault injection to simulate I/O faults at runtime (see chaos-mesh/toda). However, the current implementation is far from perfect:
If you are interested in Chaos Mesh and would like to help us improve it, you're welcome to join our Slack channel or submit your pull requests or issues to our GitHub repository.
This is the first post in a series on Chaos Mesh implementation. If you want to see how other types of fault injection are implemented, stay tuned.