The Unix process API is unreliable and unsafe

Table of Contents

It is difficult to write software which reliably and safely manages processes on Unix; this is because of flaws in the API. I know of no Unix system which has solved these problems, but in this article I'll be specifically talking about the Linux process API.

When I say "process API", I am talking about the interface for starting processes, monitoring processes until they exit, and stopping processes that haven't yet exited.

I won't cover here the various issues with fork; that has already been discussed enough elsewhere.

So what is so bad about all the other parts of the Linux process API?

  1. It's easy for started processes to accidentally "leak", and linger on the system without anyone responsible for killing them.
  2. Conversely, it's difficult or impossible to guarantee that a buggy or malicious process has not "leaked" some processes.
  3. Rather than use secure identifiers for processes (such as a file descriptor), global process identifiers (pids) separate names from authority, making it possible to confuse one process for another.
  4. Process events are communicated using signals to the parent process, making libraries (which can't safely handle signals) unable to spawn processes, as well as inheriting all the other problems of Unix signals.

I'll explain each problem, as well as some possible solutions that are ultimately flawed. At the end of the article, I'll go over my own solution and how it solves all of these problems.

But first, some motivation. Why should we care about flaws in the Linux process API? Things work well enough right now, don't they?

Yes, there are many tools like process supervisor daemons and Linux containers which fix parts of the process API. All of these tools so far are flawed in one way or another, as I'll explain, but they're still in wide use and they are adequate solutions when used carefully.

The real problem is the significant complexity cost of working around the flaws of the process API. The aforementioned tools have many features and complex interfaces, and implementing their workarounds yourself can lead to even more complexity. At the moment, any system that tries for reliability must pay this complexity cost. But if the native process API was not so flawed, no such complex hacks would be needed, and our systems would be both simpler and more robust.

1. It's easy for processes to leak

What do I mean by a "leaked process"?

I mean a process that is running without any other entity on the system which is both:

  1. Responsible for killing the process, and
  2. Knows the identity of the process and can kill it precisely with no risk of collateral damage

As soon as a process is orphaned (that is, its parent dies), it's leaked, because only the parent of a process is able to safely kill it. Merely knowing the pid is insufficient to safely kill a process, as we'll discuss in the section on process ids. So while an orphaned process might have some other entity on the system with property 1, that entity can never satisfy property 2.

Once some processes have been leaked, you have two options:

  • Hope that they will exit on their own at some point.
  • Look at the list of processes, fire off some signals, and hope you didn't just kill the wrong thing.

Neither is particularly sastisfying, so we'd hope that it's hard to make an orphaned process. But, unfortunately, it's quite easy.

A process is orphaned when its parent process exits. If process A starts process B, which starts process C, and then process B exits, process C will be orphaned.

This is as simple as:

sh -c '{ sleep inf & } &'

'sh' is our process A; it forks off another copy of itself to perform the outer '&', which is our process B; then 'sleep inf' is our process C.

The parent process B is able to robustly track the lifetime of its child process C, through the mechanisms Linux provides for parent processes, and ensure that process C exits. If and when C is orphaned, that mechanism is no longer easily usable; it can be used by the init process, but that's not typically accessible to us.

1.1. Flawed solutions

1.1.1. Make sure B always cleans up on exit and kills C

B should just always make sure to kill process C before it exits. That way no processes will be orphaned, so no processes will be leaked, and we'll be fine.

Well, what are the possible ways B might exit and need to clean up?

B might choose to exit, possibly by throwing an exception or panicking. In those cases, it's possible for B to kill process C immediately before exiting.

Or B might receive a signal. B might be signaled for conventional reasons, such as a user pressing Ctrl-C, in which case B can still clean up, as long as the programmer or runtime take care to catch every kind of signal.

Or B might be signaled for some more unconventional reasons, such as a segmentation fault. It's still possible for B to clean up in this case, but it may be very tricky to do, and the programmer or runtime may need to take great care to make sure that the pid of C is still accessible even while handling a segfault.

Or B might receive SIGKILL. Unfortunately, this case prevents this solution from working. It's not possible for B to clean up when it receives SIGKILL, so C will be unavoidably leaked.

We might want to say, "never send SIGKILL". But that is impossible, both for a conventional reason and an ironic reason. The conventional reason is that B might have a bug, and hang, and SIGKILL might be the only way to kill it. The ironic reason is that the only way for B to clean up and exit in guaranteed finite time is for it to SIGKILL its own children, so that if they have bugs they will not just hang forever. So B would be SIGKILL'd by its own parent, implementing the same strategy.

So, in summary, it's not possible to guarantee that B cleans up and kills C when it exits, because it might be SIGKILL'd. Even in the case where B isn't SIGKILL'd, it's tricky for a complicated program to always make sure to kill off any child processes when it exits.

1.1.2. Use PR_SET_PDEATHSIG to kill C when B exits.

We can use the Linux-specific feature PR_SET_PDEATHSIG on process C, to ensure that process C will receive SIGKILL (or another signal) whenever process B exits for any reason, including if process B exits uncleanly due to a bug or SIGKILL.

The issue is that this only works one level down the tree. If C forks off its own process D, the death signal will kill off C but not D.

Extending it to work over an entire tree of processes, requires that the entire tree be using PR_SET_PDEATHSIG (and using it correctly). If we can make that guarantee, this technique will work. But in practice, most large systems can't make that guarantee since they are made up of a large number of programs from many different developers. As one specific example, many applications run subcommands through Unix shells, which don't use PR_SET_PDEATHSIG.

Even in smaller systems where we control all involved programs, this technique isn't perfect, since even programs we control can always have bugs and fail to use PR_SET_PDEATHSIG. We'd prefer a guarantee that relies only on the program at the root of the tree, and doesn't require us to reason about and debug all the programs involved.

1.1.3. Always write down the pid of every process you start, or otherwise coordinate between A and B

B could make sure to always write down the pid of every process it starts, so that we can at least make an attempt to kill any orphaned processes, even if that attempt isn't robust. More generally, B could coordinate with A, and somehow tell A about every process B starts. Then A (which we might trust to be correctly implemented) can handle cleaning up the processes that B starts. This will fail if there's a bug in B, or if B is killed just after starting a process but before telling A, but perhaps it's good enough?

This has the same flaw as PR_SET_PDEATHSIG, in that it only allows for avoiding leaks at a single level. Like PR_SET_PDEATHSIG, all programs involved would need to use our mechanism. And that's infeasible in practice in any large system.

1.1.4. Make sure C always exits when B does

C should just always exit when B does. One common way to do this is for C to read stdin or some other file descriptor and exit as soon as the file descriptor returns EOF; then B can make sure that that file descriptor is the read end of a pipe whose write end only exists in B, which therefore will return EOF as soon as the write end closes, which will happen automatically when B exits.

Unfortunately most Unix programs don't follow any kind of protocol like this; ones which read from stdin do, but of course often stdin is used for other things, like reading from regular files or communicating with other processes in a pipeline.

Again, like with PR_SET_PDEATHSIG, every process in the system would need to be careful to only start children which follow this scheme, and that's infeasible in practice in a large system.

1.1.5. A should run B inside a container

If A runs B inside a Linux container technology, such as a Docker container, then no matter how many processes B starts, A will be able to terminate them all by just stopping the container, and we'll be fine.

Ignoring the other merits of containers, if we're trying to solve the problem of "it is too easy for processes to leak", containers have three main flaws.

  1. It's not easy to run a container. Python has a "subprocess.run" function in its standard library, for starting a subprocess. Python has no "container.run" function in its standard library, to start a subprocess inside a container, and in the current container landscape that seems unlikely to change.

    Shell scripts make starting processes trivial, but it's almost unthinkable that, say, bash, would integrate functionality for starting containers, so that every process is started in a container. Leaving aside the issues of which container technology to use, it would be quite complex to implement.

  2. Containers require root or running inside a user namespace. The root requirement obviously can't be satisfied by most users. Fortunately, it's possible to start a container without being root by using user namespaces. Unfortunately, user namespaces introduce a number of quirks, such as breaking gdb (by breaking ptrace), so they also can't be used by most users.
  3. It's pretty heavyweight to require literally every child process to run in a separate container. Robust usage of pid namespaces (the relevant part of Linux containers) requires that we start up an init process for each pid namespace, separate from the other processes running in the container. This init process will do nothing but increase the load on the system, and it will prevent us from directly monitoring the started processes.

So, running every child process in a separate container isn't a viable solution. We still have no way to easily prevent child processes from leaking.

1.1.6. Use process groups or controlling terminals

Process groups and controlling terminals are two features which can be used to terminate a group of processes. Such a group of processes is usually called a "job", since Unix shells use these features and use that terminology. When processes start children, they start out in the same job, and they can all be terminated at once. So if process A put process B in a job, process A could avoid process C leaking by terminating the job.

Unfortunately, neither of these job mechanisms is nestable. If a process puts itself or its children into a new process group or gives itself a new controlling terminal, it completely replaces the old process group or controlling terminal. So that process will no longer be terminated when its original job is terminated!

In other words, if process A puts process B in a job, then process B puts process C in a job, then process B neglects to terminate process C, process C will no longer be in the job that process A knows about, so process C will leak!

So, ironically, if a child process tries to use these features to prevent its own child processes from leaking, it can inadvertantly cause them to leak. This is certainly unsuitable.

1.1.7. Use Windows 8 nested job objects

Windows 8 added support for nested job objects. Child processes (and all their transitive children) can be associated with a job, and they will all be terminated when the owner of the job exits (or deliberately kills them). Child processes can create their own jobs and assign their own children to those jobs, without interfering with or being aware of their parent job.

Unfortunately, we're using Linux, not Windows. :)

2. It's impossible to prevent malicious processes leaks

What's a "malicious process leak"?

Well, if a "process leak" is a process existing on the system without someone knowing to kill it, a "malicious process leak" is a process existing on the system and actively evading being killed.

A process can fork repeatedly to make a thousand copies of itself, or just fork constantly at all times, leaving the previous processes to immediately exit, so that its pid is constantly changing and the latest copy can't be identified and sent a signal. A "fork bomb" is one example of an attack of this kind.

But note that this doesn't have to be the result of an attack; simple buggy code can cause this. If you ever program using fork(), you could easily start forking repeatedly just from a bug.

We would like to be able to put in a moderate amount of work to completely prevent this kind of issue. Unfortunately, completely preventing this from happening is very difficult, maybe even impossible.

2.1. Flawed solutions

2.1.1. Run your possibly-malicious process inside a container or a virtual machine

If we run our possibly-malicious process inside a container or virtual machine, then no matter how much it forks and exits, we will be able to terminate the process by just stopping the container (or virtual machine).

This will actually work, to a degree. Most of our earlier concerns (it's too hard and too heavyweight) no longer apply, because in this section we're happy to have any means at all to prevent the attack.

However, it still requires root access (or the use of unprivileged user namespaces) to a run a container or a virtual machine. So this solution is not truly general purpose; we can't use this routinely, every time we create a child process, because our application certainly should not run with root access in the normal case, nor can everything run in an unprivileged user namespace.

We can partially get around the need for root access by having a privileged daemon start processes on our behalf inside a container.

systemd, for example, with its systemd-run API, allows us to request that systemd start up a process for us on the fly. systemd runs every process in a separate cgroup (which is the underlying container mechanism that we would use), so it can protect against the malicious process leak problem.

But having someone else start a process on our behalf breaks a lot of traditional Unix features. For example, we can't easily have our child process inherit stdin/stdout/stderr from us, nor will it inherit environment variables or any ulimits we've placed on ourself. The shell, among other applications, is completely dependent on these features.

Also, this privileged daemon centralizes all the processes we start on the system. We can't, say, set up a truly isolated environment for development or integration testing, because we'll still have to go through the central daemon.

So as a general-purpose mechanism, this is not workable, but it can work in certain constrained scenarios.

2.1.2. Limit the number of processes that can exist on the system

What if we limit the number of processes that can exist on the system? Then as the process keeps forking, it will eventually exhaust the available process space and stop, and in that frozen moment of tranquility, an already-started process would be able to kill it.

The number of processes that can exist is actually already limited; there's a maximum pid, and we can't have any more processes than that. The issue is that as processes exit, possibly due to being killed by us, their space is usually freed up, and new processes can be created.

So if the malicious process just keeps forking, it can fill up the space left by previous processes exiting, and this doesn't help us. Stricter limits on the number of processes can prevent fork bombs, but not more general attacks.

However, if we could prevent space from being freed up as processes exit, the space that malicious process has to operate in would shrink and shrink, until finally it is no longer able to fork any more, and we can kill the last copy. Preventing the reuse of process space while under possible attack can be done using a technique that I'll discuss at the end of this article. It's a key part of a robust solution to the process leaking problem.

3. Processes have global, reusable IDs

A process is identified using its 'pid'. A pid is an integer, frequently in the range 1 to 65536, which is selected for the process at startup from the pool of currently unused pids, and which is relinquished back into that pool when the process exits.

There is a single pool of process IDs on the system. If enough processes are started and exit, a process ID will be reused.

Pids are mainly used to send signals to processes with the "kill" system call (which is used for any kind of signal, not just lethal ones).

Typically, a long-lived process (a "daemon") would write its own pid into a file, called a "pidfile". Then other processes could send signals to the daemon by reading that pidfile and using "kill".

But there is absolutely no guarantee that when you "kill", you are sending a signal to the right process. If the daemon has exited, and enough processes have started and stopped since then, the pid in the daemon's pidfile might point to a completely unrelated process. You might send a fatal signal to something critically important instead of the daemon you meant to send it to!

You might try checking the command that the pid is running, or other details about the process, before killing the process. But that's no guarantee either - between the time you check and the time you signal, the process may have died and been replaced.

Fundamentally, any usage of a pid (other than for your own child processes) is vulnerable to a time-of-check-to-time-of-use race condition. Since pids are the only way to identify a process, this means any interaction with processes other than your own child processes is inherently racy.

3.1. Flawed solutions

3.1.1. Don't reuse pids, use a UUID instead

The kernel could identify processes with some kind of truly globally unique identifier. Then users wouldn't have race conditions when they try to kill them. Note that users can't implement this solution in userspace by tagging the processes with a UUID; it must be a kernel-provided a UUID to avoid the TOCTOU issues.

This would work, but it would be difficult to retrofit onto an existing Unix system: Many applications assume that pids are the same size as 16-bit or 32-bit ints.

We would also pay an efficiency cost, just because of handling a larger identifier. It would be unusual for an operating system to provide references to its internal structures with UUIDs, when it can use more efficient smaller identifiers and provide security through other means.

3.1.2. Only send signals to your own child processes

When process A starts process B, and then process B exits, process A is notified. Furthermore, process B leaves a "zombie process" behind after it exits, which consumes the pid until process A explicitly acts to get rid of the zombie process. These two features allow process A to know exactly when it is safe to send signals to B's pid. So if A stays running for as long as B is running, and only A sends signals to B, we can have signals without races.

This works, and is an excellent replacement for pidfiles. But it doesn't work in all situations.

What if process A exits unexpectedly? Then we are back in the situation of not being able to kill process B without a race condition. Furthermore, what if we genuinely want process B to outlive process A? This is the case whenever we are starting a long-lived process (a daemon), for example.

To support this, instead of forking off a process, process A could send a request to a long-lived supervisor daemon to start process B, as the supervisor daemon's own child. The authors of supervisor daemons such as systemd or supervisord often urge software developers not to fork off their own long-lived processes; instead, say the supervisor daemon authors, we should request that the supervisor daemon fork off long-lived processes on our behalf.

Unfortunately, that has the same issues as discussed in the section on preventing malicious process leaks, where we considered having a privileged daemon create containers on our behalf. We can't easily have our child process inherit stdin/stdout/stderr from us, nor will it inherit environment variables or any ulimits we've placed on ourself. The shell, among other applications, is completely dependent on these features. And this daemon centralizes the processes we start on the system, so it's difficult to set up isolated test or development environments.

Furthermore, even if we have a supervisor daemon starting processes on our behalf, this leaves a static parent-child hierarchy which cannot change. The supervisor daemon cannot, for example, start a new version of itself to upgrade, without careful use of exec, as all of its child processes will stop being its children. Nor can process A initially start up process B as process A's child, and then later decide that process B should live past process A's exit.

What we need is a way to send signals without races, without forcing a specific parent-child hierarchy. If we can make the parent-child hierarchy more flexible, it would work well. We will use this technique in combination with others as part of a full solution at the end of this article.

3.2. Correct solutions

3.2.1. Use pidfd

pidfd was added to Linux in 2019, and solves the issues with using pids to identify processes. Read the LWN summary for more information.

It allows us to break the rigid parent-child hierarchy, and replace it with a more flexible supervision hierarchy, based on passing file descriptors around.

4. Process exit is communicated through signals

Process exit is communicated to the parent of a process by SIGCHLD. If process A starts process B, and then process B exits, process A will be sent the SIGCHLD signal.

Signals are delivered to the entire process, and only one signal handler can be registered for each signal.

So if the main function in process A registers a signal handler for SIGCHLD, and library L1 in process A starts a process B, when process B exits, the signal handler of the main function in process A will receive a notification of the exit of a child it never started, and the library will never be told that its child has exited.

Conversely, if the library L1 registers the signal handler, and the main function or even another library L2 starts a process B, then only L1 will be notified when the process exits.

In general, only one part of the program can directly receive signals. That one part of the program then must forward the signal around to whatever other components desire to receive signals. If a library has no interface for receiving signal information, like glibc, then it can't use child processes. This is a major inconvenience for both the library developer and the user.

4.1. Flawed solutions

4.1.1. Use signalfd

While signalfd is certainly a great help in dealing with signals on Linux, it doesn't actually help deal with the problem of libraries receiving SIGCHLD. You could use signalfd to wait for the SIGCHLD signals, but you still then need to forward the signals to each library.

4.1.2. Chain signal handlers

Can't we just have one library's signal handler call the next library's signal handler?

Rather than explain in this article, I refer the reader to here where it's explained that signal handler chaining can't be done robustly. Libraries have high standard for working, even in strange scenarios!

4.1.3. Create a standard library for starting children and have everyone use it

The issue is that multiple libraries want to handle the task of starting and monitoring children. Can't we just agree on a single standard library that abstracts over SIGCHLD, and have everyone use it? We can provide a file descriptor interface, which is increasingly standard on Linux, and is easy for libraries to use and monitor.

Unfortunately, it would be near impossible to get every other library that wants to use subprocesses or wants to listen for SIGCHLD to use this single standard library.

There are already plenty of libraries which provide wrappers around SIGCHLD/fork/exec, and plenty of code that depends on them. We can't just have a flag day and switch everything over to a new library all at once. This becomes even more tricky in high-level languages, because most languages already come with a higher-level API around spawning processes.

Still, the idea of providing a file descriptor interface for starting and monitoring children is a good one. File descriptors can easily be integrated into an event loop. And a file descriptor can be monitored by a library without interfering with the rest of the program, using a library's own private event loop or other mechanisms. We just need a way to provide that interface that does not interfere with other libraries in the same process.

4.2. Correct solutions

4.2.1. Use pidfd

pidfd allows a library to launch a process and get a pidfd back, then use that pidfd to monitor the process.

A file descriptor can be easily integrated into an event loop, as mentioned in the previous section.

5. How to fix all these problems

pidfd is a great solution to the third and fourth problems, but it doesn't solve the first two problems.

I only know one existing solution that fixes all these problems without sacrificing flexibility or generality.

Use the C utility supervise to start your processes; for Python, you can use its associated Python library.

Essentially, we delegate the problem of starting and monitoring child processes to a small helper program: supervise. And it abstracts away the fixes for these problems behind a nicer (but still low-level) interface. For a high-level interface, one can use the Python library.

5.1. Problem: It's easy for processes to leak

Solution: supervise kills all your descendant processes when you exit.

supervise is passed a file descriptor to read instructions from on startup, and monitors that fd throughout its (short and simple) lifetime. When the parent process exits, the fd will be closed, supervise will be notified, supervise will kill the descendant processes, and then supervise will also exit.

But because process lifetime is tied to the lifetime of a fd, it's still easy to create long-lived processes if you wish; just make sure the fd outlives your own process.

supervise is able to find all descendant processes by using PR_SET_CHILD_SUBREAPER, a Linux-specific feature. If process A starts process B which starts process C, and process B exits, then if process A has set PR_SET_CHILD_SUBREAPER then process A will become the new parent of process C. This allows supervise to safely find and kill all descendant processes.

5.2. Problem: It's impossible to prevent malicious processes leaks

Solution: supervise kills all your descendant processes when you exit, securely and in a guaranteed-to-terminate way.

It does this using the technique mentioned in the "Limit the number of processes that can exist on the system" section. If we don't free up pid space as a malicious process forks and exits, eventually the pid space will be exhausted and the malicious process can be cornered and killed.

5.3. Problem: Processes have global, reusable IDs

Solution: supervise gives you a file descriptor interface to signaling a process.

To signal the process, you just write to the file descriptor. File descriptors are local and unforgeable, so it's not possible for the file descriptor to suddenly start pointing at a different instance of supervise, wrapping a different process.

All the descendant processes of supervise will at some point become its direct children, thanks to PR_SET_CHILD_SUBREAPER, so it can safely send them all signals using "kill" and cause them to exit, so a supervision hierarchy can be maintained without forcing any specific organization.

And just like all file descriptors, the supervise file descriptors can be inherited by children or passed over Unix sockets. This allows a supervision hierarchy to be rearranged at runtime, rather than forcing a static parent-child hierarchy.

5.4. Problem: Process exit is communicated through signals

Solution: supervise gives you a file descriptor interface to monitor a process for exit.

In addition to the file descriptor that supervise reads instructions from, supervise also accepts a file descriptor to write status changes to. You can read and monitor this file descriptor to receive notification of process status changes.

6. How to really fix all these problems in the long term

Of course, supervise is not a long-term solution. Running an additional helper process for every real process you start is an annoying, if slight, inconvenience and performance loss. The correct long-term solution is to actually get this functionality into the Linux kernel.

pidfd is the obvious basis for a solution, since it solves two out of this four problems. We just need to add a way to fix the process leaking issues.

The best way to approach this would be to tie the lifetime of the process to the lifetime of the pidfds pointing to it, as pdfork does when PD_DAEMON is not set. This could be done with a new CLONE_TERMINATE_ON_CLOSE flag. Then when all the pidfds pointing to a process are closed, the process is killed. This happens naturally when the parent of the process (and any other processes that the parent sent the pidfd to) lose interest in the process.

This still allows the process to fork off and leak processes of its own; that could be addressed by various means, perhaps by using a seccomp filter to force the process to only start processes using CLONE_TERMINATE_ON_CLOSE, or by adding a new prctl that is an inheritable variant of PR_SET_PDEATHSIG. Using a pid namespace is a major hassle at the moment since it requires a dedicated init process, but it could also form the basis of a solution if that requirement was removed.

Hopefully Linux gets these features soon. In the meantime, supervise provides equivalent functionality in userspace.

Created: 2023-10-12 Thu 09:27

Validate