Your computer is a distributed system
Most computers are essentially a distributed system internally,
and provide their more familiar programming interface as an abstraction on top of that.
Piercing through these abstractions can yield greater performance,
but most programmers do not need to do so.
This is something unique:
an abstraction that hides the distributed nature of a system
and actually succeeds.
Many have observed that today's computers are distributed systems
implementing the abstraction of the shared-memory multiprocessor.
The wording in the title is from the 2009 paper
"Your computer is already a distributed system. Why isn't your OS?".
Some points from that and other sources:
- In a typical computer
there's many components running concurrently,
e.g. graphics cards, storage devices, network cards,
all running their own programs ("firmware") on their own cores,
independent of the main CPU
and communicating over the internal network (e.g. a PCI bus).
- Components can appear and disappear and
fail
independently,
up to and including CPU cores,
and other components have to
deal with this.
Much like on larger-scale unreliable networks,
timeouts
are used to detect communication failures.
- Components can be malicious and send adversarial inputs;
e.g., a malicious USB or Thunderbolt device
can try to exploit vulnerabilities in other hardware components in the system,
sometimes
successfully.
The system should be robust to such attacks,
just as with larger-scale distributed systems.
- These components are all updated on
their own schedule.
Often the computer's owner is
not allowed
to push their own updates to these components.
- The
latency
between different components of a computer is highly relevant;
e.g., the latency of communication between the CPU and main memory or storage devices
is huge compared to communication within the CPU.
- To reduce these latencies, we put caches in the middle,
just as we use caches and CDNs in larger distributed systems.
- Those caches themselves are sophisticated distributed systems,
implemented with
message passing
between individual cores.
Providing cache consistency has a latency
cost
just as it does in larger distributed systems.
As the shared-memory multiprocessor is actually a distributed system underneath,
we can, if necessary, reason explicitly about that underlying system
and use the same techniques that larger-scale distributed systems use:
- Performing operations closer to the data,
e.g. through compute elements embedded into
storage devices
and
network interfaces,
is faster.
- Offloading heavy computations to a "remote service" can be faster,
both because the service can run with
better resources,
and because the service will have
better locality.
- Communicating directly between devices
rather than going through the centralized main memory
reduces load on the CPU and improves latency.
- Using message passing instead of shared memory is
faster,
for the same reasons larger distributed systems don't use distributed shared memory.
But most programs do not do such things;
the abstraction of the shared-memory multiprocessor
is sufficient for most programs.
That this abstraction is successful is surprising.
In distributed systems,
it is often supposed that you cannot abstract away
the issues of distributed programming.
One classic and representative quote:
It is our contention that
a large number of things may now go wrong
due to the fact that
RPC tries to make remote procedure calls look exactly like local ones,
but is unable to do it perfectly.
Following this,
most approaches to large-scale distributed programming today
make the "distributed" part explicit:
They expose (and require programmers to deal with)
network failures, latency, insecure networks, partial failure, etc.
It's notoriously hard for programmers to deal with these issues.
So it is surprising to find that essentially all programs are written
in an environment that abstracts away its own distributed nature:
the shared-memory multiprocessor, as discussed above.
Such programs are filled with RPCs which appear as if they're local operations.
Accesses to memory and disk are synchronous RPCs to relatively distant hardware,
which block the execution of the program while waiting for a response.
But for most programs, this is completely fine.
This suggests that it may be possible to
abstract away the distributed nature
of larger-scale systems.
Perhaps eventually we can use the same abstractions for distributed systems of any size,
from individual computers to globe-spanning networks.