This repository contains scripts and instructions for reproducing the primary experiments in our OSDI '25 paper "XSched: Preemptive Scheduling for Diverse XPUs". This repository is also archived on Zenodo (https://doi.org/10.5281/zenodo.15327992) as "Artifacts for XSched [OSDI'25] (Version v3)".
Please check our XSched Repo for more feature updates! ^_^
git clone https://github.com/XpuOS/xsched-artifacts.git
cd xsched-artifacts
git submodule update --init --recursive
# download assets (models and media)
cd assets
bash ./download.shXSched is a preemptive scheduling framework for diverse heterogeneous accelerators (e.g., GPU, NPUs, ASICs, etc.). It is designed to be both efficient in scheduling XPU tasks and transparent to applications. We have evaluated XSched on 9 distinct XPUs, including four GPUs (NVIDIA GV100, K40m, AMD MI50, and Intel Arc iGPU), three NPUs (Intel NPU3720, Ascend 910b and NVIDIA DLA) and two ASICs (NVIDIA PVA and OFA).
This artifact contains the source code of XSched, and the code and scripts to reproduce the primary experiments in the paper. The included experiments are:
- The effectiveness of XSched's preemptive scheduling (fig7, or Figure 9 in our final paper)
- The effectiveness of XSched's multi-level hardware model (fig9, or Figure 11 in our final paper)
- Case study 1 - GPU harvesting on multi-tenant server (fig11, or Figure 13 in our final paper)
- Case study 2 - Video conferencing on AIPC (fig12, or Figure 14 in our final paper)
- Case study 3 - Multi-model inference serving (fig13, or Figure 15 in our final paper)
- Project Structure
- Paper's Hardware & Software Configuration
- Build XSched
- Experiments Overview
- Plotting the Experiments Results
> tree .
├── assets # Assets for the experiments
├── cuda # Experiment codes for NVIDIA GPUs
│ ├── fig7 # Experiment codes to reproduce fig7 (or Figure 9 in the final paper)
│ ├── fig9
│ └── fig11
├── ascend # Experiment codes for Ascend 910b
├── dla # Experiment codes for NVDLA
├── igpu # Experiment codes for Intel Arc iGPU
├── npu3720 # Experiment codes for Intel NPU3720
├── vpi # Experiment codes for NVIDIA PVA and OFA
├── results # Results of the experiments
│ └── fig7 # Results to reproduce fig7 (or Figure 9 in the final paper)
│ ├── raw # Raw results
│ └── scripts # Scripts to process and plot the results
├── sys
│ ├── tgs # TGS system used for the experiments of case study 1
│ ├── vcuda # vCUDA system used for the experiments of case study 1
│ └── xsched # Source code of XSched
The experiments were conducted on six different hardware platforms and nine different XPUs, as follows:
| Platform | XPUs | SDK/Driver |
|---|---|---|
| NVIDIA GV100 Server | GV100 | CUDA 11.4 |
| NVIDIA K40m Server | K40m | CUDA 11.4 |
| AMD MI50 Server | MI50 | ROCm 5.6 |
| Ascend 910b Server | 910b | CANN 8.0.0 |
| Intel Core Ultra 9 185H | Intel Arc iGPU, Intel NPU3720 | LevelZero 1.17, OpenVINO 2024.4, PyTorch-ipex 2.6.0 |
| NVIDIA Jetson Orin 32GB | NVIDIA DLA/PVA/OFA | JetPack 5.1.4 |
Important: XSched can also run on other platforms, but the compatibility is not guaranteed. For example, XSched can also run on other NVIDIA GPUs (e.g., server GPU like A100 and consumer GPU like 3080). However, the experiment codes in the paper are only tested on the platforms listed above.
The source code of XSched is located in sys/xsched.
XSched has minimal dependencies, and all 3rd-party libraries are included in the sys/xsched/3rdparty directory. The only requirements are a C++14 compiler and CMake (3.14 or later).
For one type of XPU, you can build XSched by running the following command:
cd sys/xsched
make cuda # Build for NVIDIA GPUs
# Or you can build for other XPUs
make hip # Build for AMD GPUs
make levelzero # Build for Intel Arc iGPU and Intel NPU3720
make ascend # Build for Ascend 910b
make cudla # Build for NVIDIA DLA
make vpi # Build for NVIDIA PVA and OFAIt is worth noting that building XSched does not require the XPU SDK (e.g., CUDA, ROCm, etc.) to be installed. So these build commands can be run on any machine without the XPU SDK. However, the experiments must be run on the machines with the corresponding XPU SDK.
These commands will build the xsched and install it in the sys/xsched/output directory, which contains the following files:
output/
├── bin
│ ├── xcli # XSched Command Line Interface
│ └── xserver # XSched Scheduler Server
├── include # XSched header files
└── lib
├── cmake # CMake files for including XSched in your project
├── libcuda.so -> libshimcuda.so # For Interception of CUDA driver calls
├── libcuda.so.1 -> libshimcuda.so
├── libhalcuda.so # Hardware Abstraction Layer for CUDA
├── libpreempt.so # XSched preemption library
└── libshimcuda.so # Shim for CUDAWe provide a simple example to show how to use XSched to schedule tasks on NVIDIA GPUs.
See get_started/README.md for more details.
Because the experiments are conducted on different hardware platforms, we provide the scripts to reproduce the results on each platform.
Here is the overview of the included experiments and the required XPUs.
| XPU | Included Experiments |
|---|---|
| NVIDIA GV100 | fig7, fig9, fig11, fig13 |
| NVIDIA K40m | fig7, fig9 |
| AMD MI50 | fig7, fig11 |
| Ascend 910b | fig7 |
| Intel Arc iGPU | fig7, fig12 |
| Intel NPU3720 | fig7, fig9, fig11, fig12 |
| NVIDIA DLA | fig7 |
| NVIDIA PVA | fig7 |
| NVIDIA OFA | fig7 |
This experiment evaluates the effectiveness of XSched's two scheduling policies (Fixed priority policy and Bandwidth partition policy) to schedule the a set of tasks on different XPUs.
All XPUs are used in this experiment, but you can run partial experiments with the XPUs you have available.
In each experiment, we run two processes that periodically or continuously submit identical types of tasks to the same XPU. One process is designated as foreground process, while the other is background process.
For each policy and each XPU, we run three experiments with different approaches:
- Standalone: Only the foreground process is executed.
- Native (or Base): Two processes use the hardware native scheduler.
- XSched: Two processes are scheduled by XSched with the corresponding policy.
fig7 (top) shows the latency CDF of the foreground process using Fixed priority policy. fig7 (bottom) shows the normalized throughput of the background process using Bandwidth partition policy.
Expected results:
- XSched can effectively reduce the latency of the foreground process (near to standalone) using Fixed priority policy.
- XSched can effectively partition the bandwidth of two processes (3:1) using Bandwidth partition policy.
| XPU | README |
|---|---|
| NVIDIA GV100 | README |
| NVIDIA K40m | README |
| AMD MI50 | README |
| Ascend 910b | README |
| Intel Arc iGPU | README |
| Intel NPU3720 | README |
| NVIDIA DLA | README |
| NVIDIA PVA | README |
| NVIDIA OFA | README |
This experiment evaluates the effectiveness of XSched's multi-level hardware model on the scheduling performance.
NVIDIA GV100 or V100 is required for this experiment.
This experiment runs hand-crafted kernels that can execute a given time to simulate the task being preempted. We use XSched to suspend the task and measure the preemption latency.
We compare the P99 preemption latency of three different preemption levels:
- Level 1: XSched only uses level-1 api (launch & sync) of the hwQueue to suspend the task.
- Level 2: XSched uses level-2 api (deactivate & activate) of the hwQueue to suspend the task.
- Level 3: XSched uses level-3 api (interrupt & restore) of the hwQueue to suspend the task.
Implementing a higher-level interface (e.g., Level 2 or Level 3) can effectively reduce the preemption latency.
| XPU | README |
|---|---|
| NVIDIA GV100 | README |
This experiment is a case study on the scheduling of GPU tasks on multi-tenant server.
NVIDIA GV100, AMD MI50 are used in this experiment. NVIDIA GV100 is required for this experiment at least.
The basic setup of this case study is a multi-tenant server running two containers, one for production job (Pjob) and the other for opportunistic job (Ojob). Pjobs have stringent performance requirements with minimal degradation, while Ojobs should harvest remaining GPU resources on a best-effort basis.
This experiment consists of two workloads:
- Co-training workload: Two containers running the two DL model training jobs.
- Sci-Fin workload: One container (Pjob) running financial algorithm (BlockScholes) and one container (Ojob) running scientific computing (CFD).
For NVIDIA GV100, we run each workload with five approaches:
- Standalone: Only the Pjob is executed.
- Native: Two containers use the hardware native scheduler.
- TGS: Two containers are scheduled by TGS.
- vCUDA: Two containers are scheduled by vCUDA.
- XSched: Two containers are scheduled by XSched.
For AMD MI50, we run each workload with three approaches:
- Standalone: Only the Pjob is executed.
- Native: Two containers use the hardware native scheduler.
- XSched: Two containers are scheduled by XSched.
- XSched w/o prog: Two containers are scheduled by XSched, but without the progressive command launching technique.
Using XSched, the Pjob can get a stable performance (near to standalone) and the Ojob can still make progress.
| XPU | README |
|---|---|
| NVIDIA GV100 | README |
| AMD MI50 | README |
This experiment is a case study on the scheduling of video conferencing tasks on AIPC.
This experiment must be run on Intel Core Ultra 9 185H.
This experiment runs two processes to simulate the video conferencing scenario:
- lfbw: Linux Fake Background Webcam. We modified it to read a mp4 video file as the input video stream instead of a true webcam device. The task runs 25 frames per second.
- whisper: A speech-to-text model. We modified it to read a wav audio file as the input audio stream instead of a microphone device. The task runs every three seconds, and outputs the transcribed text to the console.
We compare the performance of two approaches:
- Native (or Base): Two processes use the hardware native scheduler.
- XSched: Two processes are scheduled by XSched with a laxity-based scheduling policy (when lfbw is about to miss the deadline, its priority will be lifted).
XSched can effectively reduce the P99 frame latency of the fake-background task and avoid frame freeze.
| XPU | README |
|---|---|
| Intel NPU3720 | README |
This experiment demonstrates how XSched can be integrated into Triton Inference Server to enable priority-based scheduling of multi-model inference tasks.
This experiment must be run on NVIDIA GV100.
This experiment uses two Triton clients to send inference requests of two Bert-large models to a Triton server. The high-priority client sends requests with a frequency of 10 req/s, while the low-priority client sends requests continuously.
We compare the performance of four approaches:
- Standalone: Only the high-priority client is executed.
- Triton: using vanilla Triton Inference Server.
- Triton+Priority: using Triton Inference Server with its priority-based scheduling feature.
- XSched: using Triton Inference Server integrated with XSched.
By integrating XSched into Triton Inference Server, the high-priority client can get a stable performance (near to standalone).
| XPU | README |
|---|---|
| NVIDIA GV100 | README |
This experiment compares the performance of XSched with Paella (SOSP'23, a state-of-the-art inference serving system) on NVIDIA GV100.
This experiment must be run on NVIDIA GV100.
This experiment uses the same setup and workloads as in Paella's paper (i.e., concurrent serving of multiple DNN models).
We compare the performance of four approaches:
- Native: the CUDA-MS basline used in Paella's paper (i.e., directly use multiple CUDA streams to serve multiple requests concurrently).
- Paella: the Paella system.
- XSched: Integrating XSched into the CUDA-MS baseline.
With XSched, the CUDA-MS baseline can get a performance comparable to Paella.
| XPU | README |
|---|---|
| NVIDIA GV100 | README |
See README for more details.