SCX_MUS, or Mostly Unfair Scheduler, is a custom Linux scheduler built with the sched_ext framework to optimize container priorization in Kubernetes for improved performance.
The idea for this project started casually at a bar table. Our group, patos, all computer science students, was gathered when a friend showed us the new Linux scheduling technology, sched_ext and how it was being used to improve performances of games, for example. We were fascinated, but it felt unrealistic for students to build a scheduler from scratch.
Months later, during a discussion about Kubernetes CPU management, specifically why cpu.limits is often discouraged and why cpu.shares is generally preferred, we began exploring how computational resources are allocated to containers. This led to a question: Is it possible to dynamically give more compute resources to a container, beyond the usual horizontal scaling or static allocation in manifests?
The idea was born: to attempt giving a container higher priority via a custom scheduler.
The system consists of three main components:
-
The Scheduler (eBPF SCX): A sched_ext program loaded into the kernel. It implements a Weighted Virtual Time (vtime) algorithm with a Global Dispatch Queue.
main.bpf.cdefines a single shared dispatch queue. Every runnable task is inserted with it'sdsq_vtime[1], and any CPU pulls from that queue. For simplicity, we decided on two weights which drive fairness:HIGH_PRIORITY(4096) for the selected container's cgroup ID, andNORMAL_PRIORITY(1024). Inkube_stopping, we updatevruntime[2] asdelta_exec * NORMAL_PRIORITY / task_weight, so the prioritized cgroup's tasks acculumate time four times slower and therefore runs more often. -
The Runner (C): A native loader responsible for loading the BPF object, pinning maps, and attaching the scheduler to the CPUs.
runner.chandles the complete lifecycle of the scheduler program. it loads the compiled scheduler, pins thehigh_prio_cgroupsBPF map atsys/fs/bpf/high_prio_cgroups, and attaches thekube_opsstruct-ops to the kernel. This process must stay alive while our scheduler is running, when it exits it unpins the BPF map and the kernel falls back to CFS. -
The Agent (Go): A Kubernetes-aware TUI (Terminal User Interface) that interacts with the K8s API and container runtime (Containerd/Docker) to identify target Cgroups and update the BPF priority map.
main.gogives Kubernetes cluster operators an interactive interface to select which pod/container should be prioritized. It enumerates pods on the current node via the Kubernetes API, lets the user pick one with a Survey prompt, then uses Containerd to resolve the container's host PID. With/proc/<pid>/cgroupwe find the unified cgroup path,statit to derive the inode (the cgroup ID the BPF program expects), and write it into the pinned BPF map using thecillium/ebpflibrary. The agent currently writes a single entry into the map, matching our current idea of boosting one cgroup at a time. -
Test Workload: The manifests can be used to deploy a Redis target and noisy neighbor workload that eats CPU. The Go agent flags the Redis pod's cgroup so the scheduler favors its CPU slices even while the stress pod churns.
-
Benchmarks:
benchmark.pyrepeatedly runsmemtier_benchmarkagainst the Redis exposed NodePort (30001), capturingops/secand percentile latencies into CSVs, We run the scripts twice, once under vanilla CFS and once with SCX_MUS active, to quantify how much the prioritized Redis pod's latency improves while noise is present.
This repository is organized as follows:
-
scheduler/: Source code for the eBPF scheduler (Kernel-space C) and the User-space Agent (Go).
-
bpf/: The sched_ext logic written in C.
-
main.go: The Kubernetes agent CLI.
-
-
manifests/: Kubernetes YAML manifests used to deploy the test environment (Redis and Stress-ng).
-
evaluation/: Comprehensive performance analysis and benchmarking tools.
-
datasets/: Raw CSV data from benchmark runs.
-
benchmark/: Python automation scripts for memtier_benchmark.
-
-
writeups/: Personal writeups on the inner workings of multiple parts of the system and the process of creating it
For complete instructions, check our writeup
To run the benchmarks, check the guide
If you wish to know the overall performance go to evaluation/scheduler_analysis.ipynb.
One of the hardest parts was figuring out how to implement and run a scheduler using sched_ext_ops. There is very little documentation or guides online on how to load and execute your own custom scheduler, only examples of people running pre-made schedulers included with SCX.
Another challenge was our initial workflow: we wrote most of the code without compiling or testing it (a terrible practice, we know). Only after reaching a reasonable implementation did we start the work of compiling the kernel with sched_ext support, setting up Kubernetes clusters and designing the benchmark.
The debugging phase involved days of solving compilation mysteries, and unexpected behaviors before everything finally worked.
The goal of our custom scheduler, wasn't to critique the Completely Fair Scheduler (CFS), our scheduler is, in fact, a highly simplified version of CFS. Instead, it was a two-fold endeavor:
-
To see if students could successfully implement a scheduler from scratch.
-
To explore different approaches to container scalability by dynamically adjusting a container's priority/resource share via a custom scheduler.
We consider the project a success in demonstrating both of these concepts.
We learned about:
-
The Linux scheduler architecture and its multiple scheduling classes
-
Kernel internals and low-level scheduling paths
-
Kubernetes resource management and API communication
-
eBPF development and the sched_ext subsystem
-
Performance evaluation, benchmarking, and debugging complex systems
-
Building a more sophisticated control mechanism that uses the Kubernetes API to gather metrics and automatically adjust the container's priority share based on workload (e.g. implementing a hook that prioritize a container when its netns be with a X quantity of packets)
-
Refining SCX_MUS by:
- Adding multi DSQs for multi-cores enviroments
- Developing more sophisticated migration heuristics and improving L2/L3 cache locality
PATOS is an open source group that focuses on giving talks and contributing to the open source community.
Our team for this Hackathon consisted of 3 members of PATOS.
We had no practical experience working with eBPF or sched_ext, so this Hackathon was a great learning opportunity!
We welcome contributions from the community! If you'd like to contribute, feel free to submit a pull request.

