Tips for concurrent programming

Here are some tips for concurrent programming.

Use communication mechanisms which have built-in failure detection

You should use a communication mechanism that supports notifying you when the other side has failed. Such as pipes and sockets, which will return an EPIPE or other error if you read or write when there's no-one on the other side.

Not having a built-in failure detection system can cause deadlocks and issues, on both the sender and receiver side.

Consider, as an example of the wrong thing, "writer creates a file in a directory, reader waits for it to show up". This gives the writer no indication if the reader is dead and will never receive the file, and the reader no indication if the writer is dead and will never send the file.

Here are a few examples of communication mechanisms with failure detection:

Pipes
Socketpairs
Unix domain sockets
TCP connection sockets

Don't add timeouts

An arbitrary timeout on an operation is usually wrong; you'll spontaneously fail in a different network or load environment, even when you might otherwise have succeeded.

A timeout is a way to avoid nontermination, not a way to detect failures. You should detect failures by using communication mechanisms with built-in failure detection, not with timeouts. Timeout-based failure detection will increase the latency of your system and decrease its ability to respond to failures, because you have to wait for the timeout to know a failure has happened.

The right reason to use a timeout is because otherwise your program might never terminate. And you only need one such timeout, at the top-level or in the user interface, to make sure that your program does in fact terminate. So don't add more timeouts on your own; whoever is running your program or calling your service is the one responsible for timing out.

Some docs on this:

Google calls it "deadline propagation".
The trio concurrency library docs talk about it.

Prefer completion interfaces to readiness interfaces

Prefer interfaces where you send a request and don't get a response until the operation is complete. This is sometimes called "completion notification", or "the completion model for async", and it's easier to use and easier to get right. See discussions of readiness vs completion models.

For example, suppose you're trying to allocate some resource, and there are no resources available when you send the request. In a completion interface, you just send the request, and wait however long (minutes, hours) that it takes for the response to come back with an allocated resource.

The alternative is a readiness interface; where you wait for the right state for the operation, and only then do it. For example, you wait until you receive a notification telling you that there are resources available, and then you send a separate request to allocate that resource.

A readiness interface requires the implementer to spend less resources on tracking outstanding requests, but it's harder for the user to use correctly. In particular, in a completion interface, waiting and operating are coupled together, so it's impossible to wait for the "wrong state" for a given operation, or wait for the state in an incorrect way. In readiness interfaces, it's all too easy to wait on the wrong thing, as I'll discuss in the next section.

Some completion interfaces:

the "read" and "write" syscalls
using socket activation when starting up processes, so clients can immediately send requests
IOCP on Windows

Prefer readiness interfaces to sleeping (or nothing)

In a readiness interface, you wait for the right state for an operation before performing it.

A surprisingly common misdesign is to initiate some preparations, then to simply sleep for some number of seconds, and assume that everything is in the right state once you wake up. This will result in a program which is both slow, because it waits longer than it needs to, and buggy, because sometimes it doesn't wait long enough and causes failures.

Another surprisingly common misdesign is to not sleep or wait at all, but to initiate some preparations and then to just assume that everything is ready immediately afterward. This is like sleeping, but for a random amount of time which depends on how long your code takes to run. If you actually instantaneously performed the operation after initiating the request, you'd always fail, but because there's some small delay, you might get lucky 99.99% of the time. Eventually, however, under load or with bad scheduling, you'll get unlucky and things will break.

Instead, you should wait for things to be in the correct state. You'll need to receive some kind of notification about the state of various entities involved in whatever operation you're performing. If you're writing client-side code, don't be afraid of adding support for these kinds of status updates on the server-side. Again, make sure you use communication mechanisms with built-in failure detection.

Some readiness interfaces:

the server can only perform certain requests when they're allowed; the client has to wait for the server to notify that the requests are allowed to be able to send those requests without encountering failures
starting up daemons and waiting for them to be ready
waiting for O_NONBLOCK fds to be readable

Some more tips on readiness interfaces:

When using readiness interfaces, don't wait for the wrong thing

Sometimes you can be performing operations on one thing, and the status updates you're getting are for a different thing, and it's easy to confuse them.

Remember that communication is never instant; just because service X is ready, or has seen some event, or something, doesn't mean that service Y has.

And remember that some services forward operations to other services; you need status updates based on the state of the service you're actually interacting with, not just the proxy.

When implementing readiness interfaces, don't send ready-notifications too soon

Make sure that when you say "X is ready", that means any requests for X which come instantaneously at that point will succeed. Don't say "X is ready" until that's true.

It's always better to delay a ready-notification, than to issue it too soon. If you issue it too soon, the ready-notification is useless to your client: They can't send an operation immediately, because it might still fail. They have to resort to sleeping - exactly what we were trying to avoid.

Prefer to propagate failures up

It is not your responsibility, nor should you even try, to paper over every possible failure.

Be willing to just push a failure up instead of handling it. In doing so, you're including that failure as part of the (likely informal) specification of your component.

There are many ways to propagate failures up:

Allow exceptions to propagate rather than catching them
Respond to requests from your users with errors
Crash the service, so the process supervisor sees the process exit non-zero

This makes your system more responsive to failures and avoids getting things stuck in a failed state.

Each component should handle only one or a few kinds of failures; the rest should pass through and be propagated upstream towards the user.

When handling a failure, don't just retry

When you do handle a failure rather than propagating it upstream, handle it right.

Do something to fix the failure before retrying. Inside the body of your retry loop, do something like:

waiting for more data to be available, before trying to read again
waiting for more events to arrive, before running more event handlers
waiting for the service to be ready again, before sending another request
counterfactually releasing the lock, before trying again to take the lock

Don't busy loop; that is, don't just wrap a loop around:

do nothing and try again
sleep for an arbitrary amount of seconds, before sending another request

If you don't do anything to fix the failure, it may never be fixed. Worse, it may be fixed by others purely accidentally, which means your retries will mask serious failures and create unnecessary delays.

And busy looping wastes resources. Without busy looping, an idle process can be paused and paged out, so that it consumes no CPU time, memory, or energy until the operation it's waiting for completes.

And don't limit your number of retries. Either your fix will work eventually, or it's not a fix at all. Putting a limit on the number of retries is just setting an arbitrary timeout.

If a certain component can't fix the failure, then it shouldn't be retrying at all; it should immediately propagate the failure upstream instead. Eventually the failure will reach a level where it can be dealt with properly.