Miscellaneous Linux ideas

Table of Contents

All of this should be usable without privileges. Most of this will require getting rid of the setuid bit as a prerequisite.

1. filesystems as dirfds

1.1. replace chroot with explicitly passing in a root dirfd

You would need to disallow ".." for this to work.

1.2. replace cwd with explicitly passing in a cwd dirfd

1.3. DONE replace mount with two syscalls: one creating a dirfd, one mounting it

This way we can start a filesystem and access it, and keep it private without creating a mount namespace

(this is essentially possible with the new mount API)

1.4. or, replace kernel mount with per-process userspace mount tables

Instead of mounting in the kernel, just let a process multiplex between a set of dirfds that it has access to.

This allows doing very fancy policy, without paying the cost of that policy in the kernel.

2. FD passing of dirfds as an RPC mechanism

Use Plan 9's approach to designing RPC interfaces as filesystems, but don't have the 9P protocol and mounted filesystems behind it; instead just have raw dirfds and pass those around.

That is, rather than mounting the RPC interface into your global filesystem namespace as with Plan 9 or FUSE, you can just contact some server (perhaps over a Unix socket) and get a dirfd passed back to you. That dirfd doesn't need to exist in any global filesystem namespace.

2.1. Prerequisite: Lightweight userspace implementation of dirfds

We need some way for a userspace program to implement a dirfd.

Both FUSE and 9P have the problem that you can't return an arbitrary FD from, say, an openat call. So you couldn't return a socket, or a dirfd for another server. If you wanted to return such things, you'd need to stick around as a proxy in the middle, which reduces your opportunities for abstraction. In other words, you want to be able to do introduction rather than proxying.

So a new way to implement a dirfd is needed. It would be nice if it didn't even require mounting a filesystemm.

One way is described in the next section: file descriptors backed by uploaded in-kernel eBPF programs.

3. File descriptors backed by uploaded in-kernel eBPF programs

So you can do IPC/privilege isolation, without the cost of switching address spaces.

They should be able to own other file descriptors as well and arbitrate access.

You could use this to implement lots of other file descriptor behaviors which would otherwise need kernel modules.

That would increase the power of programs which take file descriptors as arguments, because they would be much more customizable: You could give them file descriptors which do more sophisticated things.

This is just a specialization of the universal concept of "servers/objects which you access through a capability/interface and which are privilege-isolated/abstract". The capability we have in Unix is file descriptors, so we should be able to create new kinds of objects backing those file descriptors easily, so that we can work with more generic designs which are based around passing around opaque file descriptors and operating on/with them.

You can do this today with FUSE/CUSE to some degree but only for a very limited subset of syscalls. (for example, you couldn't emulate waitid(P_PIDFD))

3.1. Application: Revocability of arbitrary file descriptors through proxying

If a file descriptor is proxied by an eBPF program, that program can hold some extra FD which it monitors for shutdown commands; this gives us revocability.

This would also work if userspace processes were able to emulate arbitrary file descriptors, but an extra roundtrip to userspace for every file access is pretty heavyweight.

3.2. Application: Remote file descriptors

One could have a type of fd which communicates with a remote server, transparently proxying each syscall to an fd on a remote server. Then a local program could operate on remote resources just by being handed an fd for a remote resource.

3.3. Application: Implementing everything else on this page that mentions file descriptors

4. remove setuid dependence in userspace

the filesystem setuid bit is a big limiting factor, we should get userspace ready for a world without the setuid bit.

4.1. IPC-based sudo replacement

ssh can work for this purpose. There's also s6-sudo and systemd-run, but I think ssh is the most plausible way to do this, since it's very popular and powerful already.

4.2. alternatives for mutating inherited environment in privileged ways

How do we inherit our current environment, while mutating it in some way that requires privs? This currently uses setuid

example: unshare

4.2.1. expose the environment for external mutation by privileged programs

Traditionally on Unix you mutate your current environment, relying on fork() to allow you to isolate changes.

If you could mutate environments from the outside, that would "work".

Could be done with CRIU

4.2.2. setuid without persistence

Maybe we could declare that setuid executables can't exist in the filesystem, but still allow them to be created and passed via IPC.

Then you can create these capabilities on the fly.

though they won't be revocable

4.3. set NO_NEW_PRIVS prctl on important components

5. file-descriptor API for processes

I should be able to start a process and get an FD back, signal that process through the FD without races, and when I close that FD the process exits.

see clone_fd, pdfork

5.1. DONE decent, unprivileged process groups

We have cgroups but 1. the API sucks, 2. they're privileged

I still hold out hope for something integrated with the traditional process tree, or at least something informed by a FD-api for processes

(I did this with supervise)

5.2. process supervisor which can function as bash, gdb, systemd, tmux, etc.

Having separate process supervisors hurts flexibility; I want to be able to start debugging a program started in my shell, or access stdin/stdout of a program started by my init system, or so on. Attaching to pids is not a solution; a sane model for ownership of processes is needed.

Perhaps they should be a single program.

5.3. DONE or, composable process supervision

Better yet, perhaps an FD api would allow sharing ownership; or even borrowing, and revocability.

Or possibly it should be possible to move/share process ownership between the programs; an FD API could allow that.

(This is in the kernel now: pidfd)

6. enhanced interactive shell

shell/command line interface is still best. it works remotely and can be detach'd from. it has trivial scriptability (just press up to do your last action again, possibly modified! and that's just the start.),

however it needs some cleanup

6.1. reimplement coreutils to take file descriptors rather than filenames


6.2. integrate with copy-on-write

Python-notebook-style revision of commands and idempotency

6.3. clean up weird language

traditional sh/bash is just not a good language

this might be difficult to do at the same time as adding new features; maybe backwards compatibility is useful, and tolerable if new features are added in a clean way?

7. Improved terminal/GUI

shell is still best interface, no buttons are needed; but it could still have a better display engine: display of completions, inline docs, incremental results, etc., will all enhance the shell

also display of graphics would not hurt

7.1. notebook style shell/terminal

Python (and other languages) notebooks are very powerful interactive workspaces

seems like a natural design to steal for improved shells

7.2. give up and use the browser as the display engine

It will make a lot of the work easy since it's already done, and it's already got built-in remote support.

but it is controversial.

7.3. determine how to sanely output graphics from Unixy tools

Possibly support a new convention for a passed-in FD, "stddisplay" or "stdgui", on which a tool can send and receive GUI information?

Of course this is just the same as DISPLAY in X or WAYLAND_DISPLAY in Wayland.

7.4. pass in dirfd to extensibly provide new interfaces to command line tools

8. better Linux-native scripting language

I want something that makes advanced Linux primitives natively available. It should have the syntactic sugar of the shell, but updated for what is available in modern Linux, not just Unix.

I want to e.g. easily poll/epoll on a set of FDs, or pass FDs/credentials over Unix domain sockets or set up nested epoll structures, or pipe together programs with socketpairs, etc.

It should be GC'd and type-safe.

8.1. DONE expressive library for using Linux features in existing scripting language

Python? C++? OCaml?

I want something with the fluidity of Go, which achieves it through using Linux kernel APIs and syscalls to the maximum degree possible.

(This is now rsyscall)

9. DONE forkat(), file-descriptor API for distributed computing

Doesn't make sense that I can fork() and start competing to consume more of a global resource. Instead I should have to forkat() some passed-in reference to a scheduler timeslice, ala seL4. And that scheduler might be remote, or it might be a timeslice provided by my parent from subdividing their own.

(This was one of the ideas behind rsyscall; a file descriptor which represents control over a process, which you can of course use to fork in the same place as that process)

9.1. generalized ssh which returns an fd for use with forkat

Likewise with containers, I guess.

9.2. shell that understands running on remote systems through this mechanism

(Emacs kind of already does this with TRAMP. And someday I'll write an rsyscall library for Emacs…)

10. remove the implicit authority of the UID

Any process with my UID can screw with any of my other processes, that's not good.

10.1. dynamically allocated users?

A userspace daemon accessible over IPC, which responds to requests by returning a setuid executable owned by an unused UID which does nothing but execvp its arguments.

The holder of one of these executables can then switch into that UID. The executable is basically the capability to switch in the UID, reified.

This means the setuid bit will stay around. Maybe restrict it to root?

How to garbage collect these users?

10.2. or, subusers?

The old standby that everyone seems to want. Some new kernel capability which lets an arbitrary user subusers which they can su to.

It's easier with users being defined by strings rather than UID numbers; you can just allow a user to append "/anystringlikethis" to their username

Kind of like Kerberos. Is it possible for principals to create subprincipals? I don't think so, but maybe it could be added

10.3. or, just a simple prctl to relinquish your UID-based authority?

Users are a stupid concept anyway, we want an object-capability system. So just a prctl which lets you relinquish the authority of your UID?

10.4. filesystem usage without UIDs

How can you use the fileystem without a UID?

Maybe just pass 777 dirfds around?

And pass subdirs around if you need to give access out?

Reconstructing an ACL system in the absence of UIDs would be helpful for user expectations. Maybe if UIDs were reified into a capability you could use to access the filesystem? that's kind of like the setuid idea

11. get Capsicum into Linux

Capsicum is great

12. distributed federated shared community Unix cluster

A distributed compute system which can be freely joined. Like the community Unix clusters of old, but over the net at large.

Same foundations and technologies as large academic or corporate systems, but making it as easy as possible for boxes to federate in and pool resources (while restricting which users they permit access to).

12.1. cert chains/web of trust for authentication

Kerberos won't work, it requires trusting each box. We want a box to be able to join without everyone having to trust that box.

Trusted community/university computing organizations can have certs, which can sign certs of new users to give them access to the cluster

12.2. allocate box-local UID when a user appears for the first time on a box

LDAP sucks, and it's centralized anyway, and hopefully we'll get rid of UIDs eventually anyway

12.3. automated onboarding/administration

It should be possible to bridge a machine into the network automatically, without requiring any human intervention.

Continuing administration should also be as small as possible.

12.4. denylist/allowlist/decentralized authorization/accounting

Each box can do authorization and accounting in its own way. Not sure about this though

12.5. all users unprivileged

It only makes sense if this is the case. This excludes the admins.

There's lots of stuff possible unprivileged, you can even run KVM-accelerated VMs unprivileged.

Created: 2021-12-01 Wed 16:21