Docker Considered Harmful

Table of Contents

Docker is extremely popular these days. Too bad it's not very good.

A note in advance: This is absolutely not about Docker being too "opinionated" for me, or other tools being more flexible. I believe that learning and using Docker is just plain more complicated than learning and using the tools I describe below. Docker is genuinely more complex and harder to use than the alternatives. These tools also happen to be more flexible than Docker, but that's not why I'm recommending them: I'm recommending them because they are simpler to learn and use. If they are indeed more flexible in addition to being simpler to use, then that's just due to an overall superior design.

Docker containers are not mysterious

First, a brief explanation of how containers work. Linux containers1 are built on two kernel features, namespaces and cgroups. Their architecture is quite easy to understand.

I encourage everyone to read the main namespaces man page: man 7 namespaces. It's well written and makes it easy to grok the concept. If you create a new instance of all2 of these namespaces, you have something like a container.

The cgroups documentation (located at Documentation/cgroups-v1/ and Documentation/cgroup-v2.txt in your local copy of the Linux source code) is less straightforward, but still a better explanation than I could write. The basic idea is that cgroups are a mechanism for grouping processes. This mechanism is used to implement other systems like man 7 cpuset, which are used to track and schedule the container processes.

If you make the appropriate system calls to move into a cpuset cgroup and create new namespaces, you have a container. There's not much to it.

One could write a relatively short C program and create a Unix-style utility that uses these system calls to start up a new container. You can see this for yourself by playing around with man 1 nsenter and man 1 unshare, which are the most minimal possible wrappers around the namespaces syscalls.

The point of explaining this is to show that the Linux container functionality is all rather simple. Docker (or any other container software) are not doing anything especially mystifying in the specific area of bringing up a container. Armed with that knowledge, let's look at what else Docker is actually doing.

Docker for building containers is superfluous

We'll start off with how Docker builds a container image for you. You pull down some kind of image from the Docker hub, and Docker hums excitingly for a bit while you watch things scroll and progress bars fill. What you end up with is a filesystem tree from some Linux distro, with a few things added in on top.

It may be surprising to some that we have been doing exactly this for multiple decades now.

In fact, you do this every time you install GNU/Linux on a machine. The majority of the files in that filesystem tree come from packages from some distro. And package managers are certainly capable of installing packages into arbitrary directories; that's how they install a new system.

In fact, most even have neat little wrapper scripts to do it for you! And these are only an apt-get install debootstrap (or equivalent) away! To build filesystem trees for a few of the most popular distros3:

  • debootstrap focal /srv/trees/ubuntu
  • debootstrap stable /srv/trees/debian
  • dnf -y --releasever=33 --installroot=/var/lib/machines/f33 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal glibc-minimal-langpack
  • pacstrap /srv/trees/arch

And of course you can select additional packages to install using these commands, or make other changes. This has been used for decades to build chroots, which I'll say more about a bit later. There are even more novel package managers, like Nix and Guix, which have interesting features that can make things even easier.

But wait, the distro version of node.js (for example) is too out of date! How am I going to get the latest version?

Well, the first thing you should do if you need more up to date versions is enable the updated package repositories for your distro: Ubuntu backports, Debian backports, Fedora EPEL. Package managers are not just for show; they really do make it easier to keep your system up to date, among their many other advantages.4

If a suitably updated version is not available through distro channels, I am obligated to suggest that the next best option is to update the distro package, which may be quite easy, depending on your distro. Or, create a package yourself if one isn't available. This can be a bit of a pain if you are in the early development stages (though there are tools to make it easier5) but again, there are many advantages.4

Most people, however, use the traditional hacks. You can chroot in and just do the usual pip install foo or gem install bar or npm install baz or ./configure && make && make install, just as you would from some "RUN" directives in a Dockerfile.

So here, at least, there is no real advantage of Docker.

Crucially, you can use all the same install scripts that you would use with a normal Linux machine. You don't need to rewrite everything into Dockerfiles. You can do it manually, you can use shell scripts, you can use Ansible, you can write a boutique ConfigurationManagementFactory in Java, you can do whatever you like. It's just installing software. It's not complicated unless you make it complicated. Supposedly, Dockerfiles are simpler than running debootstrap at the beginning of your script, but I'm not sure I understand why. It seems to me that Docker is no simpler or easier than the standard way.

Now, it's true that Docker uses layering to be efficient in terms of disk space and time to build new containers. It defaults to using OverlayFS to do this.6 You could reimplement it easily yourself with a small shell script and some calls to mount, but there's no need.

Instead, I just use man 8 btrfs-subvolume. btrfs is a copy on write filesystem which can instantly make space-efficient copies of filesystem trees in "subvolumes", which the user sees as just regular directories.

You can build a stock Ubuntu filesystem tree into a subvolume with btrfs subvolume create /srv/trees/ubuntu && debootstrap focal /srv/trees/ubuntu/. Then, when you want to build a new container with specific software, you just copy that subvolume and perform your modifications on the copy; that is, btrfs subvolume snapshot /srv/trees/ubuntu /srv/containers/webapp and work on /srv/containers/webapp. If you want to copy those modifications, you just take another snapshot.

This is actually better than OverlayFS, because there's no need to maintain a lot of state about the mount layerings and set them up again on reboot. Your container filesystem just sits there in a volume waiting for you to start it.

Of course, if you don't like btrfs for some reason, you're perfectly able to use zfs, OverlayFS, AUFS, or whatever; no need to have a "storage driver" implemented just to do some simple copy-on-write or layering operations.

And if you want to do some kind of change tracking as you build the system, you should keep it at the proper layer, or use dedicated tools. /usr should be immutable and built from packages, your application data should live in /srv or /var and be mounted in, and so all the configuration data that is part of the system build should be in /etc. To track this, you can just use etckeeper and store your /etc in a git repository. which is right and proper since /usr should be immutable. If you must, OSTree lets you version whole filesystems.

And if you still need to pull in a Docker image for some reason, you can treat it as just another way to build a filesystem tree. There are tools that will let you do that, such as machinectl pull-dkr.

Isolation for deployment is not new

But wait! Docker isn't just a pointless abstraction layer over the simple task of building filesystem trees! It lets you actually use those filesystem trees in containers!

That's not at all novel, though. As I said earlier, these tools have been used for decades to build chroots.

What's a chroot? Well, man 1 chroot is a decades-old tool that lets you change what the root directory / points to; for example, you could point / at /srv/container/webapp. Everything looks for libraries and binaries in subdirectories of the root directory, like /usr/lib and /usr/bin. So, by using chroot you can have an entirely different set of libraries and binaries; when you run things inside the chroot, they will see just the libraries and software that you installed inside that filesystem tree.

To help explain what you can use a chroot for, here's a short little blurb I "wrote" about what you can do with chroot.

Sysadmins use chroot to provide standardized environments for their development, QA, and production teams, reducing "works on my machine" finger-pointing. By "chrooting" the app platform and its dependencies, sysadmins abstract away differences in OS distributions and underlying infrastructure.

That sure sounds useful. But wait, there's this new kid on the block, Docker. Let's see what they have to say.

Sysadmins use Docker to provide standardized environments for their development, QA, and production teams, reducing "works on my machine" finger-pointing. By "Dockerizing" the app platform and its dependencies, sysadmins abstract away differences in OS distributions and underlying infrastructure.

Docker is not novel in giving you these capabilities. They're quite novel in marketing it so intensely, though.

Docker for security is useless by default

But wait! Docker is "containers", new, fancy, exciting. A chroot is old and boring. Surely containers are better than chroots!

Well, chroot being old and boring does have advantages, like "it is not going to randomly break on me". But sure, it's true that containers have significant advantages of their own.

One example: chroots can't be relied upon for security, it's easy to break out of them if you run as root inside the chroot. Containers are especially, uniquely secure, right?

Wrong! For most purposes, the main interesting thing that Docker containers provide is isolated networking. That is, Docker containers prevent the application inside the container from binding ports on the external network interfaces. What else prevents applications from using ports? The firewall that you already have installed on your server. Again, pointless abstraction to address already-solved problems.

In fact, if you follow the insane default practice of running your applications as root in the container, your system may be substantially less secure than a properly implemented chroot. Breakout from an unprivileged chroot depends on a well-known and well-studied area of exploits: Linux privilege escalation. Linux namespace containers present an entirely new security surface; it's quite possible that they have inherent vulnerabilities that are impossible for the kernel to correct without breaking uncontained functionality. Indeed, Docker's own developers enthusiastically admit that Docker cannot (yet) securely run code as root. For decades people have been running their applications as unprivileged users inside chroots to mitigate this threat. By default, Docker throws this away.

Application containers are ridiculous

But still, containers are cool, right? It's only with the development of namespaces and cgroups that Docker could finally get "application containers" right. The isolation features that Docker brings are an essential increase in power over chroot; finally we can deploy "application containers" in production. We can finally be host-independent with our applications, by shipping entire filesystems around! Right?

For those who don't know the terminology, Docker calls their approach to containers "application containers". The basic idea is that you have all these namespaces and cgroups, and you create a container, and then you run a single piece of software inside the container. That's cool, I guess. The alternative approach is to run an init system inside your container, which will bring up a full "traditional" operating system. Containers provide enough isolation to do this, and so you could treat them as very-light-weight VMs. Docker has planted itself in opposition to this practice, because…

Well, I'm actually not sure what the Docker devs were thinking here. Is it some misguided ideal of making the containers more "lightweight" by not treating them as VMs and running an init system? Did it just occur to them that they could run a single service inside a container rather than a full system, and they never bothered to question whether that might not be a good idea?

The practical problems with "application containers" are well known. Zombie orphan processes7 fill up your container and consume resources with no init to reap them; the traditional cron and syslog daemons are not automatically available; etc., etc.. These are problems, but they could certainly be overcome if we wrote enough new software dedicated to making application containers work well.

The more fundamental problem is that "application container" doesn't mean anything. We've already disentangled the filesystem isolation aspect; we know we can do that without Docker and without containers. So what is an "application container"?

It's just another system service! Just another daemon! So if you want to isolate a service, just do that! There's no need to confuse the terminology by calling it a "container".

Just use the Linux namespacing features to get isolation for your application, like everyone else. We've been securing and isolating applications for decades with chroot and su; namespaces and cgroups are just another tool in this toolbox. I'll cite systemd here as leading the way in using these technologies for system services, but sysvinit and other init systems can use namespaces and cgroups for isolation just as easily.

In this light, it's clear that there is nothing especially novel about the idea of an application container. And certainly nothing that warrants the whole new approach of Docker, which throws away so much of the existing GNU/Linux stack!

Alternatives to Docker

I think I've already covered the alternatives to the various parts of Docker in some depth. There is a little bit left to say. I mentioned in the first section that a simple, Unix-style utility could provide the containerization features, in something like the same model as chroot. My feeling is that man 1 systemd-nspawn is this utility. Its manpage even explicitly compares it to chroot:

systemd-nspawn may be used to run a command or OS in a light-weight namespace container. In many ways it is similar to chroot(1), but more powerful since it fully virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems and the host and domain name.

And it's already present on every systemd system, so it's easy to start using. Check out the examples in the man page. Combining it with other parts of the GNU/Linux ecosystem, like debootstrap and btrfs, you can have something with all the power of Docker, or more8, without the complexity overhead. Ultimately, Docker is just too complex for the simple functionality it provides; there's just no need for it.

Footnotes:

1

The use of containers (or more generally, "operating-system level virtualization") is not especially new, of course. For many years now Solaris has had zones, FreeBSD has had jails, and other operating systems have had other such technologies. These are polished and working solutions for their respective operating systems. (at least, I assume so, judging from how their partisans brag about them) Indeed, even with Linux there was Linux-VServer and OpenVZ.

The key difference with Linux containers (or "namespaces-based containers") is that it is actually included in the upstream Linux kernel. Linux-VServer and OpenVZ were "out-of-tree" patchsets, which were maintained separate from the main kernel project, and applied as patches to a vanilla kernel to add their respective features. This tremendously increases maintenance load and decreases the cleanliness of the code, and indeed both of these projects are now unusably out of date. Namespaces and cgroups, on the other hand, are present in the main Linux source tree, and the kernel development policy means that they will be kept up to date with any future changes in the Linux codebase. Thus it seems reasonably likely that all further attempts to bring containerization to Linux will use these technologies as their foundation.

2

User namespaces are useful for securing containers, but are arguably still under development; Docker doesn't implement them, nor do many other container tools. I believe LXC is the only mainstream container tool that does. I've heard it said that user namespaces are a bit strange and unlike other namespaces; they can be used without privileges, for example, and they let you, kind of, "fake" having capabilities. Who knows what new security vulnerabilities this introduces?

3

These examples pulled from here.

4

Package managers save you a lot of work when you need to do upgrades, or widely deploy the software, or install more than one custom library. Here, look at this page.

5

Checkinstall and fpm are tools for quickly building packages, and are suitable for novices who don't care about package management. Of course, at some point, one really should learn how to directly build packages of one's preferred format (rpm or deb).

6

Docker also supports btrfs and the Linux device mapper for implementing layering.

7

Processes on Unix-like operating systems are organized into a hierarchy; a normal process will have one parent and zero or more children. When any process terminates, it is dependent upon its parent process to wait(2) on it; until this happens, the terminated process is known as a "zombie". On orphan processes, from Wikipedia:

An orphan process is a computer process whose parent process has finished or terminated, though it remains running itself.

In a Unix-like operating system any orphaned process will be immediately adopted by the special init system process. This operation is called re-parenting and occurs automatically. Even though technically the process has the "init" process as its parent, it is still called an orphan process since the process that originally created it no longer exists.

Thus if pid 1 does not wait(2) on a terminated ("zombie") orphan process, it will stick around forever. Creating orphan processes that will be cleaned up by init is quite a common Unix programming idiom, so this is a rather significant problem. See the Phusion baseimage for another explanation of the problem, and some software that has been written to work around this problem with Docker.

8

Checkpoint and Restore In Userspace (CRIU) allows the "freezing" and resuming of Linux processes, with all kinds of interesting applications. Docker does not yet support CRIU. Other software like LXC does fully support live migration through the use of CRIU.

Created: 2021-03-22 Mon 21:52

Validate