Summary of the FlexSC paper
Table of Contents
This is a summary of the 2010 paper "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls", which is one of my favorite papers, interspersed with some of my own commentary. (This is adapted from a presentation I gave at work.)
1. Problem
- A kernel provides safe APIs to operations which require control of hardware.
- You talk to the kernel with system calls.
- System calls are function calls that "mode switch" on call/return, so the kernel runs with control of hardware, and you don't.
- Mode switching is slow.
- This is the traditional understanding of "why system calls are slow": the direct cost of making the call, compared to function calls.
- This paper's observation: Mode switching is not the only cost!
- The other cost is worse locality.
- Locality: A program doing the same thing over and over. (e.g. accessing memory, executing the same code, taking the same branch…)
- The existence of locality is why caches speed up execution!
- System calls execute in the kernel, which pulls kernel-specific code and data into cache, which is slow.
- When the system call returns, your program will pull its own code and data into cache, which is again slow.
- Changing what you're doing reduces locality which makes things slow (because you have to update caches and other processor state)
- The paper finds that after a system call, the instructions-per-second is reduced by up to 50%.
- This isn't the cost of slow mode switching, it's the cost of bad locality!
- Today's Spectre/Meltdown mitigations make this worse: part of the mitigation is to flush a bunch of caches when switching into the kernel!
- This locality cost isn't specific to system calls.
- Many function calls are into libraries that have lots of internal state (OpenOnload, for example)
- Making any such function call will have these locality costs, and therefore slow you down.
2. Solution
- The problem is bad locality.
- The solution is to increase locality.
- Don't switch your core between kernel and your program: dedicate a core to your program, another core to the kernel, and send system calls from one to the other!
- Each core will then have much better locality.
- The program core isn't executing kernel code, so there's no impact on its caches.
- The kernel core isn't executing program code, so there's no impact on its caches.
- Both execute faster!
- They send system calls between cores using shared memory.
- It's similar to shared-memory multi-threaded pipelined software.
- They built a "green thread" / "N:1 threading" / "userspace threads" thread library on top of this.
- When a thread makes a system call, the system call is sent to the kernel core, and other threads execute until the original thread's result comes back.
- Their library is a drop-in replacement for standard Linux pthreads.
3. Result
- Incredible speedups!
We show how FlexSC improves performance of Apache by up to 116%, MySQL by up to 40%, and BIND by up to 105% while requiring no modifications to the applications.
- These are from the locality benefits!
- Basically no cost!
4. Today's implementations and related work
Here are some interesting related works.
4.1. Promise pipelining
- FlexSC is a generalization of "system call batching", executing multiple operations at once with a single actual system call
- A further generalization of FlexSC is promise pipelining.
- Promise pipelining lets you do multiple operations at once, with operations depending on the results of previous operations
4.2. Shared memory datastructures
- "ffwd: delegation is (much) faster than you think" applies the same approach to shared memory datastructures
4.3. rsyscall!
- rsyscall programs run in a Python interpreter thread, and send system calls to dedicated syscall-running processes.
- Just a happy coincidence; it's written for a completely different purpose.
- Should be nice and high-performance…
4.4. io_uring
- Roughly,
io_uring
is two things: FlexSC-style asynchronous system calls (IORING_SETUP_SQPOLL
), and a different in-kernel implementation of filesystem IO. - They could be separated, perhaps…