Deploying a distributed system? A type system helps a lot

Deploying and running a distributed system can be complex; a type system, even a basic one, can help with many of these complexities.

Here are a few examples which demonstrate this in Python using the type system added by PEP 484.

Together, these techniques make it easier to run your system.

You can express dependencies between services using function arguments.

When service A makes use of services B and C, we need to make sure to start B and C before A.

This is straightforwardly represented with a function:

def start_a(b: B, c: C) -> A:
    ...
    start_process([
      "/bin/a",
      "--b-url", b.url,
      "--c-url", c.url,
      ...
    ])
    ...
    return A(...)

The start_a function takes two arguments, of types B and C, and returns a value of type A. Internally, the function starts up service A, passing command-line arguments using details from the function arguments.

Now if you want to start service A, you've got create instances of B and C first - presumably by starting services B and C. That's exactly the invariant we want to enforce.

One can define such types for one's own services, and include all the relevant information in them, such as URLs to use to interact with them.

You can keep track of complex values using types.

Suppose service D can run in multiple different modes; say, it can either listen on an HTTP or HTTP2 URL. And suppose some other service E can only work with service D if it's in HTTP2 mode.

We can represent this with types:

def start_e(d: D[HTTP2Url]) -> E:
    ...
      "--d-url", d.url, # an HTTP2Url
    ...
    return E(...)


def main(d: D[HTTPUrl]) -> E:
    # type error!
    return start_e(d)

D here takes a type argument specifying what the type of d.url is. We can do this with generics or templates in most modern languages.

In main, we call start_e with the wrong type of D, and we'll get a compile-time type error as a result.

You can create different environments by passing different arguments to functions.

Suppose service F has an optional dependency on service G; it can run whether or not service G is available.

We can represent this by passing different arguments to our start_f function in different environments:

def start_f(g: Optional[G]) -> F:
    ...
    if g:
       ... "--g-url", g.url ...
    else:
       pass
    return F(...)

def environment_one() -> None:
    g = ...
    f = start_f(g)
    ...


def environment_two() -> None:
    f = start_f(None)
    ...

We can use environment_one or environment_two, each where appropriate. Such techniques can also be used for configuration more generally.

Other useful programming language features

There are a few other programming language features (besides a type system) that are helpful for deploying and running distributed systems.

You can test the deployment code like any other program.

If we use a normal program to deploy our distributed system, we can use standard testing libraries to write tests for that program.

For example, Python's standard library has a unittest module which works fine:

class TestThings(unittest.TestCase):
    def setupTest(self) -> None:
        ...
        self.h = start_h(...)

    def test_some_things(self) -> None:
        self.assertIn("OK", requests.get(self.h.url + "/status"))

These tests can be stored and developed alongside the code to deploy the distributed system, just like normal programs.

You can explore interactively using the REPL or debugger.

If we can run our system easily, we'll often want to do quick iterative exploration and changes; or maybe we'll want to explore an already-running system.

REPLs and debuggers provide ready-made sophisticated interfaces to do exactly this kind of exploration.

>> i = start_i(...)
<I object at 0x7fb3a45a4290>
>> j = start_j(i, ...)
<J object at 0x7fb3a45a4490>
>> j.url
"https://example.com"

You can use asynchronous code to monitor multiple processes in the background.

Every process needs to be monitored for failure. If a process exits, appropriate action needs to be taken, such as restarting that process or a larger collection of processes, or signaling a fatal error.

If your language has support for "asynchronous execution" or "coroutines", you can use that to monitor those processes.

An example with trio:

async def run_command_and_restart_on_error(args: List[str]) -> None:
    for _ in range(5):
        try:
            await run_command(args)
        except CalledProcessError:
            # sleep a second before restarting
            await trio.sleep(1)
            continue
        return
    raise Exception("we kept trying to restart, but we kept failing...")

async def start_k(nursery: trio.Nursery) -> K:
    nursery.start_soon(run_command_and_restart_on_error, ["k", ...])
    return K(...)

async def main():
    async with trio.open_nursery() as nursery:
        k = await start_k(nursery)