Multi-tier KV-cache simulator for LLM serving β from single-node to δΈε‘ cluster with EIC disaggregated memory and Prefill-Decode separation.
Three simulation modes:
- Single-node: 4 workers, HBM β DRAM β SSD hierarchy, 6 eviction/prefetch policies
- Cluster (δΈε‘): 10,240 GPUs across 160 racks, shared EIC (CXL/RDMA) per rack, prefix-aware routing
- PD Separated: Prefill-Decode disaggregated serving with radix tree KV cache, continuous batching, KV transfer modeling
TraceGenerator βββΆ Router (prefix trie) βββΆ Worker[0..3]
β
CacheManager
HBM βββΆ DRAM βββΆ SSD
β
EvictionPolicy (LRU/ARC/Learned/Belady)
PrefetchPolicy (None/SessionAware)
Cluster: 10,240 GPUs (simulating full 160 racks Γ 64 GPUs)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ClusterRouter (session affinity + prefix scoring) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Rack 0 Rack 1 ... Rack 7β
β βββββββββββββββββββββββ ββββββββββββββββββ β
β β GPU 0 GPU 1 ... 15 β β GPU 16 ... 31 β β
β β βββββ βββββ β β βββββ β β
β β βHBMβ βHBMβ ... β β βHBMβ ... β β
β β βββ¬ββ βββ¬ββ β β βββ¬ββ β β
β β βββββ¬ββββ β β βββββββ¬βββββββ β
β β ββββββΌβββββββββ β β ββββββββΌβββββββ β
β β β EIC Pool β β β β EIC Pool ββ β
β β β (shared CXL)β β β β (shared) ββ β
β β βββββββββββββββ β β βββββββββββββββ β
β βββββββββββββββββββββββ ββββββββββββββββββ β
β Network: intra-rack 3ΞΌs (RDMA) β cross-rack 15ΞΌs β SSD 200ΞΌs β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PDCluster: 10,240 GPUs (2,560 Prefill + 7,680 Decode, P:D = 1:3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Request βββΆ PrefillRouter βββΆ PrefillNode β
β (prefix match β β
β + load balance) β RadixTree lookup β
β β (prefix sharing, ref counting) β
β β Compute new KV blocks β
β β SessionAware prefetch β
β βΌ β
β KV Transfer (RDMA push) β
β pipelined first-chunk latency β
β β β
β DecodeRouter βββββββ β
β (same-rack pref β β
β + capacity) βΌ β
β DecodeNode β
β β Continuous batching β
β β All active sequences per step β
β β Memory-bandwidth bound β
β βΌ β
β Output tokens β
β β
β Per rack: [P P P P | D D D D D D D D D D D D] + shared EIC β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TTFT = queue_wait + prefill_compute + kv_transfer + first_decode
Key insight: Unified GPU is blocked for prefill + ALL decode steps.
PD separation frees the prefill GPU after compute only.
# Install
pip install -r requirements.txt
# Single-node demo (6 policies, HBM β DRAM β SSD)
python main.py
# δΈε‘ cluster + EIC demo
python main.py --cluster
# PD separation analysis (unified vs PD, P:D ratio sweep, transfer strategies)
python main.py --pd
# Fast smoke run with machine-readable reports
python main.py --cluster --preset smoke \
--report-json results/smoke_cluster.json \
--report-csv results/smoke_cluster.csv
# Replay a production-style Azure LLM trace instead of a synthetic trace
python main.py --pd \
--workload-trace /path/to/AzureLLMInferenceTrace_code.csv \
--workload-format azure \
--no-plot --skip-context-sweeptrace/workload.py loads public production-style LLM traces into the same
Request objects used by the synthetic generator. Timestamps are normalized to
milliseconds, input tokens become prompt KV blocks, and per-request output
tokens drive PD decode length when present.
Supported schemas:
| Source | Useful fields | Notes |
|---|---|---|
| BurstGPT | Timestamp, Session ID, Model, Request tokens, Response tokens |
Real Azure-powered ChatGPT/GPT-4 workload; session IDs preserve conversation prefix reuse when present. |
| Azure LLM Inference 2023/2024 | TIMESTAMP, ContextTokens, GeneratedTokens |
Production Azure traces used by Splitwise and DynamoLLM; no prompt text, only token counts. |
| Mooncake traces | timestamp, input_length, output_length, hash_ids |
Best fit for KV-cache experiments because hash_ids preserve prefix-sharing relationships. |
| SplitwiseSim | arrival_timestamp, prompt_size, token_size |
Compatible with Splitwise-style generated traces. |
Examples:
# BurstGPT CSV, excluding failed rows with zero response tokens by default
python main.py --pd \
--workload-trace /path/to/BurstGPT_1.csv \
--workload-format burstgpt \
--workload-limit 4000 \
--no-plot --skip-context-sweep
# Mooncake-style JSONL/CSV with hash_ids
python main.py --cluster \
--workload-trace /path/to/mooncake.jsonl \
--workload-format mooncake \
--workload-time-unit ms \
--no-plot --skip-context-sweepIf a trace has hash_ids, the loader uses them as the actual KV block IDs.
Otherwise it synthesizes deterministic block IDs from session_id when present
and falls back to unique per-request blocks. This keeps token-count-only traces
useful without pretending they contain exact prefix-sharing metadata.
Use presets for repeatable run scale:
--preset smoke: small, fast CI/sanity run; disables training, plots, and context sweeps.--preset dev: medium local run; disables plots and context sweeps.--preset prod-eval: keeps the configured production-scale values.
For production-adjacent runs, prefer:
python main.py --cluster --preset prod-eval \
--calibration-profile profiles/h100_70b_reference.yaml \
--workload-trace /path/to/mooncake.jsonl \
--workload-format mooncake \
--strict-workload-validation \
--report-json results/prod_eval_cluster.json \
--report-csv results/prod_eval_cluster.csv \
--no-plot --skip-context-sweepReports include run metadata, the full config snapshot, calibration readiness,
workload validation, cluster diagnostics, and serialized result metrics.
PRODUCTION_READINESS.md describes acceptance gates and confidence levels.
kvcache-sim/
βββ sim/
β βββ storage.py # StorageTier, KVBlock
β βββ policies.py # LRU, ARC, Learned, BeladyOracle, prefetch policies
β βββ cache_manager.py # Single-node multi-tier cache orchestrator
β βββ router.py # Prefix trie + worker pool router
β βββ metrics.py # Counters, KPIs, matplotlib visualiser
β βββ network.py # Network latency model (intra/cross-rack, P2P RDMA)
β βββ calibration.py # External benchmark/simulator profile overlays
β βββ cluster.py # GPUNode, EICPool, Rack, Cluster, ClusterRouter
β βββ radix_tree.py # KV cache radix tree (prefix sharing, ref counting)
β βββ pd_nodes.py # PrefillNode, DecodeNode, compute models
β βββ pd_router.py # PrefillRouter, DecodeRouter, PDOrchestrator
β βββ pd_cluster.py # PDCluster, PDConfig, build_pd_cluster
β βββ pd_metrics.py # TTFT/TPOT distributions, transfer stats
β βββ kv_transfer.py # KV transfer protocol (push/pull/pipeline)
βββ trace/
β βββ generator.py # Synthetic multi-turn trace (shared system prompts)
β βββ workload.py # Public CSV/JSONL workload loader
β βββ replay.py # Single-node trace replay
β βββ cluster_replay.py # Cluster-scale trace replay
β βββ pd_replay.py # PD-separated trace replay
βββ learned/
β βββ features.py # 8-dim feature engineering
β βββ train.py # LightGBM training pipeline
β βββ model.py # Online inference wrapper
βββ experiments/
β βββ run_all.py # Single-node + cluster experiments
β βββ pd_experiments.py # PD separation experiments
β βββ network_variance.py # Jitter/contention sensitivity experiments
β βββ plot.py # matplotlib comparison plots
βββ profiles/
β βββ h100_70b_reference.yaml # Example calibrated overlay
βββ config.yaml # Full configuration (all three modes)
βββ requirements.txt
βββ main.py # Entry point (--cluster / --pd)
In unified serving, a GPU does prefill (process prompt) then decode (generate tokens) sequentially. With the default 70B profile, the decode phase (128 tokens Γ ~83.6ms) can occupy a GPU for ~10.7s, blocking new prefill work β this is head-of-line blocking.
PD separation dedicates GPUs to each phase:
- Prefill nodes: Compute-bound, process prompts, free immediately after
- Decode nodes: Memory-bandwidth-bound, generate tokens via continuous batching
- KV transfer: RDMA push of KV cache from prefill β decode node
| Component | Description |
|---|---|
| RadixTree | Prefix-sharing block tree with reference counting and leaf-only eviction |
| PrefillNode | RadixTree-backed cache + session-aware prefetch + continuous batching |
| DecodeNode | Receives KV via RDMA, continuous batching of active sequences |
| KVTransferModel | Push/pull strategies, pipeline support, bandwidth modeling |
| PrefillRouter | Prefix cache hit scoring + queue-aware load balancing |
| DecodeRouter | Same-rack preference (fast transfer) + capacity-aware |
| Phase | Formula | Value |
|---|---|---|
| Prefill | 2 Γ params / TFLOPS |
~0.35 ms/token |
| Decode | 2 Γ params / HBM_BW |
~83.6 ms/token |
| Decode (64 seq batch) | base + marginal KV overhead | ~93.6 ms/step |
| KV transfer first chunk | 16 blocks Γ 5 MiB / 12.5 GB/s |
~6.7 ms |
| KV transfer full 8K prompt | 512 blocks Γ 5 MiB / 12.5 GB/s |
~215 ms |
================================================================
kvcache-sim β PD Separation Mode
PDCluster: 10,240 GPUs (2,560P + 7,680D, ratio 1:3) Γ 160 racks
================================================================
The exact numbers are workload-dependent. The PD replayer now keeps decode
sequences active across requests, so TTFT/TPOT reflect queueing, decode overlap,
P:D ratio, prompt length, output length, and KV transfer settings.
Key interpretation:
- Unified serving blocks a GPU for prefill plus the full decode phase.
- PD serving frees prefill GPUs after prefill, then admits transferred KV into decode nodes.
- Transfer reports both full KV movement and pipelined first-chunk latency for TTFT.
- P:D sweeps are meaningful only for the configured workload and hardware profile.
| # | Policy | Eviction | Prefetch | Notes |
|---|---|---|---|---|
| 1 | Baseline LRU | Least Recently Used | None | Classic, near-optimal for sequential workloads |
| 2 | +ARC | Adaptive Replacement Cache | None | Balances recency & frequency (T1/T2 + ghost lists) |
| 3 | +SessionPrefetch | LRU | Session-Aware | Predicts next blocks from session patterns |
| 4 | +SelectiveWrite | LRU | None | Only caches shallow-prefix blocks (depth <= 3) |
| 5 | +Learned | LightGBM reuse predictor | None | Trained on trace; predicts reuse distance |
| 6 | Belady Oracle | Optimal (offline) | None | Upper bound β evicts farthest-future block |
All parameters in config.yaml. Key PD separation settings:
pd_separation:
pd_ratio: [1, 3] # Prefill:Decode GPU ratio
compute:
prefill_tflops: 800 # H100 FP16 effective TFLOPS
decode_memory_bw_gbps: 3200 # H100 HBM bandwidth
model_params_b: 70 # Model size (billions)
kv_bytes_per_token: 327680 # 320 KiB for Llama-3-70B GQA
tokens_per_block: 16
prefill_batch_efficiency: 0.85
decode_kv_overhead_factor: 0.02
transfer:
strategy: push # push | pull | pull_on_demand
rdma_bw_gbps: 12.5 # effective GB/s for 100 Gbps RDMA
pipelining: trueThis simulator is intentionally system-level. It should not embed slow cycle-level GPU, DRAM, or SSD simulators in the main replay loop. Instead, run microbenchmarks or external simulators offline, then overlay their calibrated parameters:
python main.py --config config.yaml \
--calibration-profile profiles/h100_70b_reference.yaml --pd
python -m experiments.network_variance \
--config config.yaml \
--calibration-profile profiles/h100_70b_reference.yamlSupported overlay sections are hardware, cache, cluster, and
pd_separation. Unknown sections are rejected so typoed calibration keys do not
silently change nothing.
Recommended calibration sources:
| Layer | Use in this repo | External source |
|---|---|---|
| LLM operator timing | pd_separation.compute.* |
Vidur-style profiling or real serving traces |
| GPU/HBM bandwidth | hardware.hbm, cluster.gpu |
Accel-Sim for kernels, plus bandwidth microbenchmarks |
| CPU DRAM/HBM details | hardware.dram, EIC params |
Ramulator2 or DRAMsim3 |
| SSD/NVMe | hardware.ssd |
MQSim or SimpleSSD |
| Network/RDMA/NVLink | cluster.network, transfer params |
RDMA/NVLink microbenchmarks, ASTRA-sim/ns-3 for topology effects |
The default profile in profiles/h100_70b_reference.yaml mirrors the default
configuration and documents the expected shape for calibrated values.
- P:D Ratio Selection β Find optimal prefill/decode GPU split for your QPS and prompt lengths
- Prefix Cache Capacity Planning β How much HBM/EIC to allocate for KV cache vs model weights
- Interconnect Bandwidth ROI β Compare 25/50/100/200 Gbps for KV transfer overhead
- Eviction Policy Selection β LRU vs ARC vs Learned under different workload patterns
- EIC Sizing β How much shared CXL memory per rack for cross-GPU prefix reuse
- Context Length Impact β How 4K vs 32K vs 128K contexts affect cache dynamics and PD benefit
MIT