Not having a built-in failure detection system can cause deadlocks and issues, on both the sender and receiver side.
Consider, as an example of the wrong thing, "writer creates a file in a directory, reader waits for it to show up". This gives the writer no indication if the reader is dead and will never receive the file, and the reader no indication if the writer is dead and will never send the file.
Here are a few examples of communication mechanisms with failure detection:
A timeout is a way to avoid nontermination, not a way to detect failures. You should detect failures by using communication mechanisms with built-in failure detection, not with timeouts. Timeout-based failure detection will increase the latency of your system and decrease its ability to respond to failures, because you have to wait for the timeout to know a failure has happened.
The right reason to use a timeout is because otherwise your program might never terminate. And you only need one such timeout, at the top-level or in the user interface, to make sure that your program does in fact terminate. So don't add more timeouts on your own; whoever is running your program or calling your service is the one responsible for timing out.
Some docs on this:
For example, suppose you're trying to allocate some resource, and there are no resources available when you send the request. In a completion interface, you just send the request, and wait however long (minutes, hours) that it takes for the response to come back with an allocated resource.
The alternative is a readiness interface; where you wait for the right state for the operation, and only then do it. For example, you wait until you receive a notification telling you that there are resources available, and then you send a separate request to allocate that resource.
A readiness interface requires the implementer to spend less resources on tracking outstanding requests, but it's harder for the user to use correctly. In particular, in a completion interface, waiting and operating are coupled together, so it's impossible to wait for the "wrong state" for a given operation, or wait for the state in an incorrect way. In readiness interfaces, it's all too easy to wait on the wrong thing, as I'll discuss in the next section.
Some completion interfaces:
A surprisingly common misdesign is to initiate some preparations, then to simply sleep for some number of seconds, and assume that everything is in the right state once you wake up. This will result in a program which is both slow, because it waits longer than it needs to, and buggy, because sometimes it doesn't wait long enough and causes failures.
Another surprisingly common misdesign is to not sleep or wait at all, but to initiate some preparations and then to just assume that everything is ready immediately afterward. This is like sleeping, but for a random amount of time which depends on how long your code takes to run. If you actually instantaneously performed the operation after initiating the request, you'd always fail, but because there's some small delay, you might get lucky 99.99% of the time. Eventually, however, under load or with bad scheduling, you'll get unlucky and things will break.
Instead, you should wait for things to be in the correct state. You'll need to receive some kind of notification about the state of various entities involved in whatever operation you're performing. If you're writing client-side code, don't be afraid of adding support for these kinds of status updates on the server-side. Again, make sure you use communication mechanisms with built-in failure detection.
Some readiness interfaces:
Some more tips on readiness interfaces:
Remember that communication is never instant; just because service X is ready, or has seen some event, or something, doesn't mean that service Y has.
And remember that some services forward operations to other services; you need status updates based on the state of the service you're actually interacting with, not just the proxy.
It's always better to delay a ready-notification, than to issue it too soon. If you issue it too soon, the ready-notification is useless to your client: They can't send an operation immediately, because it might still fail. They have to resort to sleeping - exactly what we were trying to avoid.
Be willing to just push a failure up instead of handling it. In doing so, you're including that failure as part of the (likely informal) specification of your component.
There are many ways to propagate failures up:
Each component should handle only one or a few kinds of failures; the rest should pass through and be propagated upstream towards the user.
Do something to fix the failure before retrying. Inside the body of your retry loop, do something like:
And busy looping wastes resources. Without busy looping, an idle process can be paused and paged out, so that it consumes no CPU time, memory, or energy until the operation it's waiting for completes.
And don't limit your number of retries. Either your fix will work eventually, or it's not a fix at all. Putting a limit on the number of retries is just setting an arbitrary timeout.
If a certain component can't fix the failure, then it shouldn't be retrying at all; it should immediately propagate the failure upstream instead. Eventually the failure will reach a level where it can be dealt with properly.