Less Moves, More Queries: Exploiting Access Redundancy in CXL-enabled Database Systems
Rubicon is a disaggregated database system prototype that reduces redundant data movement in read-heavy analytical workloads over far-memory interfaces such as CXL. It batches incoming queries, groups them by overlapping column access patterns (pipeline grouping), and selectively copies shared hot data from far-memory (CXL) to near-memory (DRAM) using a cost model that weighs one-time copy overhead against reuse savings.
Paper under review at PVLDB 19, 2026
Rubicon/
├── computeUnit/ # Compute Unit (CU): executes grouped query plans via SIMD operators
├── grouper/ # Grouper: batching, pipeline grouping, and cost-model decisions
├── experiments/ # Python experiment drivers and workload configs
│ ├── configs/ # YAML workload definitions (Morning Dashboard, etc.)
│ └── utils/ # Shared utilities (TCP client, query definitions, arrival models)
├── data/ # SSB data generation and binary conversion
├── ext/SIMDOperators # Git submodule: SIMD-based operator library (includes TSL generator)
├── pg_potential_experiment/ # Standalone microbenchmark (Figure 1 in the paper)
├── proto/ # Protobuf message definitions
├── CMakeLists.txt # Top-level CMake build
└── generate_python_proto.sh # Script to generate Python protobuf bindings
Linux (tested on Ubuntu). NUMA support is required.
sudo apt update
sudo apt install -y \
build-essential \
cmake \
git \
libnuma-dev \
libprotobuf-dev \
protobuf-compiler \
python3 \
python3-pip \
graphviz-devMinimum versions: CMake ≥ 3.20, C++20-capable compiler (GCC ≥ 11 or Clang ≥ 14), protobuf (compatible with protobuf==4.21.12 for Python bindings).
The ext/SIMDOperators submodule includes the TSL (Template SIMD Library) generator as a nested submodule (tools/tslgen). The CMake build invokes this generator, which requires Python 3 and several packages. Install the TSL generator's dependencies:
pip install -r ext/SIMDOperators/tools/tslgen/requirements.txtNote: If you cannot install
graphviz-dev(needed forpygraphviz), removepygraphvizfrom thatrequirements.txtand pass--no-draw-test-dependenciesto the generator if invoked manually.
The data/ directory uses uv for dependency management. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shDependencies (installed automatically by uv run): numpy, pandas, progressbar2 (see data/pyproject.toml).
Required Python version: ≥ 3.12, < 3.13
The experiments/ directory also uses uv. Dependencies (see experiments/pyproject.toml):
humanize, matplotlib, netifaces, pandas, protobuf==4.21.12, pyyaml.
Required Python version: ≥ 3.12, < 3.13
The paper evaluates on:
- Compute Unit server: Intel Xeon 6741P (48 cores, up to 3.8 GHz), 512 GB DDR5, SMART Modular Technologies CXA-8F2W CXL 2.0 Type 3 add-in card (512 GB), configured as a headless NUMA node.
- Grouper server: AMD Ryzen Threadripper PRO 7955WX (16 cores), 256 GB DDR5. Connected via 1 GbE Ethernet.
For testing without CXL hardware, the NUMA evaluation (Section 7 of the paper) shows the approach also works on a dual-socket server where remote-socket DRAM serves as far-memory.
Both the Grouper and the Compute Unit can run on the same machine for functional testing.
git clone --recurse-submodules https://github.com/AndreasGeyerTUD/Rubicon.git
cd RubiconIf you already cloned without --recurse-submodules:
git submodule update --init --recursivecd dataEdit setup.sh to set the desired scale factors. By default it generates SF 100. For the paper's experiments, use SF 100 (or SF 50 for the NUMA evaluation):
# In setup.sh, change SCALE_FACTORS to your desired value, e.g.:
# SCALE_FACTORS=(100)Then run:
bash setup.sh
cd ..This clones the ssb-kit generator, builds it, generates CSV data, converts it to binary column format using uv run convert_to_binary.py, and cleans up intermediate files. The resulting binary data will be in data/binary/sf<N>/.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)This builds three executables in build/bin/:
| Executable | Description |
|---|---|
grouper |
The Grouper server (batching, grouping, cost model) |
computeUnit |
The Compute Unit (query execution engine) |
pg_potential_bench |
Standalone microbenchmark for pipeline grouping potential |
Optional CMake flags:
-DLOG_LEVEL=LOG_DEBUG1— Enable verbose logging (default:LOG_INFO)-DGROUPER_VERBOSE=ON— Enable verbose grouper output-DGROUPER_PROFILING=ON— Enable CSV timing for grouping analysis
The experiment scripts communicate with the Grouper using Protobuf over TCP. Generate the Python bindings:
bash generate_python_proto.shThis creates experiments/proto/*_pb2.py files.
The system consists of three components that must be started in this order:
Terminal 1 — Start the Grouper:
./build/bin/grouper -port 23232Terminal 2 — Start the Compute Unit:
./build/bin/computeUnit \
-ip 127.0.0.1 \
-port 23232 \
-path data/binary/sf100 \
-basedata bin \
-worker 48 \
-node 0 \
-cxl_node 1| Flag | Description |
|---|---|
-ip |
IP address of the Grouper (use 127.0.0.1 if on the same machine) |
-port |
Port the Grouper is listening on |
-path |
Path to the binary SSB data directory |
-basedata |
Data format: bin (binary) or csv |
-worker |
Number of worker threads (default: logical cores per NUMA node) |
-node |
NUMA node affinity for compute threads (-1 = all nodes) |
-cxl_node |
NUMA node where base data resides, e.g., the CXL device node (-1 = any) |
Tip: Use
numactl -Horlscputo identify which NUMA node corresponds to your CXL device or remote socket.
Terminal 3 — Run an experiment (see below).
All experiment scripts are in the experiments/ directory and use uv for dependency management. Run them with uv run from within the experiments/ directory.
Measures the overhead introduced by the pipeline grouping algorithm at varying concurrency levels.
cd experiments
uv run grouping_overhead_experiment.py \
-ip 127.0.0.1 \
-port 23232 \
--scale-factor 100 \
--workers-per-cu 48 \
--n-values "10,25,50,100,150,200,300,400,500,650,800,1000" \
--iterations 10Compares Naïve vs. Pipeline Grouping execution at different concurrency levels.
cd experiments
uv run concurrent_query_execution_experiment.py \
-ip 127.0.0.1 \
-port 23232 \
-q "q_1_1,q_2_1,q_3_1,q_4_2" \
--scale-factor 100 \
--workers-per-cu 48 \
--n-values "5,10,15,20,25,30,40,50,60,80,100" \
--window-size 100 \
--iterations 10 \
--out results/concurrent_experimentThe -q flag specifies which SSB queries to evaluate (CSV). Available queries: q_1_1 through q_1_3, q_2_1 through q_2_3, q_3_1 through q_3_4, q_4_1 through q_4_3.
Runs production-inspired workload profiles defined as YAML configs. To run the Morning Dashboard workload:
cd experiments
uv run workload_experiment.py \
-config configs/workload_morning_dashboard.yaml \
-ip 127.0.0.1 \
-port 23232 \
--window-size 2000 \
--threshold 10Available workload configs in experiments/configs/:
| Config File | Paper Section |
|---|---|
workload_morning_dashboard.yaml |
Section 6.1 |
workload_adhoc_exploration.yaml |
Section 6.2 |
workload_peak_mixed.yaml |
Section 6.3 |
workload_microbursts.yaml |
Section 6.4 |
workload_stress_test.yaml |
Section 6.5 |
To reproduce Table 5, run each workload config with multiple --window-size values (100, 500, 1000, 2000, 3000 ms). Each experiment runs the baseline before it runs the workload.
Additional flags:
| Flag | Description | Default |
|---|---|---|
--window-size |
Batching window in ms | 100 |
--threshold |
Overhead threshold for grouping | 10 |
--max-overhead |
Maximum overhead ratio (τ_ρ) | 2.0 |
-dry-run |
Validate config without executing | — |
-o |
Override output CSV path | — |
This microbenchmark does not require the Grouper or experiment scripts. It runs directly on a NUMA system:
./build/bin/pg_potential_bench <cxl_node> <dram_node> [reps] [chunk_mib] [csv_file] [--scalar]Example (CXL on NUMA node 1, DRAM on node 0):
./build/bin/pg_potential_bench 1 0 5 64 results.csvNote: This benchmark allocates 48 × 5 GiB on the CXL/remote node plus 5 GiB on the local node. Ensure sufficient memory.
The cost model generalizes to any near-/far-memory asymmetry (Paper Section 7). On a dual-socket server, use remote-socket DRAM as far-memory:
./build/bin/computeUnit \
-ip 127.0.0.1 \
-port 23232 \
-path data/binary/sf50 \
-basedata bin \
-worker 24 \
-node 0 \
-cxl_node 1Here -node 0 pins compute to socket 0, and -cxl_node 1 places base data on socket 1's DRAM (the "far-memory" tier).
- CXL thermal throttling: The CXL device used in the paper is thermally unstable at high concurrency. Experiments are capped at 100 concurrent queries. See Section 5.1 of the paper for details.
- Single Compute Unit: The current prototype supports only a single CU. Multi-CU execution is architecturally supported but not evaluated.
- No cross-batch caching: Each query batch operates independently; copied data is released after batch completion. Cross-batch caching is discussed as future work (Section 3.3).
This project is licensed under the GPL-3.0 License.
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) through Reinhart Koselleck Project: Serverless Data Management Primitives for Software-defined Composable Systems — project number 450276976.