Summary of the FlexSC paper

1. Problem
2. Solution
3. Result
4. Today's implementations and related work

This is a summary of the 2010 paper "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls", which is one of my favorite papers, interspersed with some of my own commentary. (This is adapted from a presentation I gave at work.)

1. Problem

A kernel provides safe APIs to operations which require control of hardware.
You talk to the kernel with system calls.
System calls are function calls that "mode switch" on call/return, so the kernel runs with control of hardware, and you don't.

Mode switching is slow.
This is the traditional understanding of "why system calls are slow": the direct cost of making the call, compared to function calls.

This paper's observation: Mode switching is not the only cost!
The other cost is worse locality.
Locality: A program doing the same thing over and over. (e.g. accessing memory, executing the same code, taking the same branch…)
The existence of locality is why caches speed up execution!

System calls execute in the kernel, which pulls kernel-specific code and data into cache, which is slow.
When the system call returns, your program will pull its own code and data into cache, which is again slow.
Changing what you're doing reduces locality which makes things slow (because you have to update caches and other processor state)

The paper finds that after a system call, the instructions-per-second is reduced by up to 50%.
This isn't the cost of slow mode switching, it's the cost of bad locality!

Today's Spectre/Meltdown mitigations make this worse: part of the mitigation is to flush a bunch of caches when switching into the kernel!

This locality cost isn't specific to system calls.
Many function calls are into libraries that have lots of internal state (OpenOnload, for example)
Making any such function call will have these locality costs, and therefore slow you down.

2. Solution

The problem is bad locality.
The solution is to increase locality.
Don't switch your core between kernel and your program: dedicate a core to your program, another core to the kernel, and send system calls from one to the other!
Each core will then have much better locality.

The program core isn't executing kernel code, so there's no impact on its caches.
The kernel core isn't executing program code, so there's no impact on its caches.
Both execute faster!

They send system calls between cores using shared memory.
It's similar to shared-memory multi-threaded pipelined software.

They built a "green thread" / "N:1 threading" / "userspace threads" thread library on top of this.
When a thread makes a system call, the system call is sent to the kernel core, and other threads execute until the original thread's result comes back.
Their library is a drop-in replacement for standard Linux pthreads.

3. Result

Incredible speedups!

We show how FlexSC improves performance of Apache by up to 116%, MySQL by up to 40%, and BIND by up to 105% while requiring no modifications to the applications.

These are from the locality benefits!
Basically no cost!

4. Today's implementations and related work

Here are some interesting related works.

4.1. Promise pipelining

FlexSC is a generalization of "system call batching", executing multiple operations at once with a single actual system call
A further generalization of FlexSC is promise pipelining.
Promise pipelining lets you do multiple operations at once, with operations depending on the results of previous operations

4.2. Shared memory datastructures

"ffwd: delegation is (much) faster than you think" applies the same approach to shared memory datastructures

4.3. rsyscall!

rsyscall programs run in a Python interpreter thread, and send system calls to dedicated syscall-running processes.
Just a happy coincidence; it's written for a completely different purpose.
Should be nice and high-performance…

4.4. `io_uring`

Roughly, io_uring is two things: FlexSC-style asynchronous system calls (IORING_SETUP_SQPOLL), and a different in-kernel implementation of filesystem IO.
They could be separated, perhaps…