Skip to content

cklxx/kvcache-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

kvcache-sim

Multi-tier KV-cache simulator for LLM serving β€” from single-node to 万卑 cluster with EIC disaggregated memory and Prefill-Decode separation.

Three simulation modes:

  • Single-node: 4 workers, HBM β†’ DRAM β†’ SSD hierarchy, 6 eviction/prefetch policies
  • Cluster (万卑): 10,240 GPUs across 160 racks, shared EIC (CXL/RDMA) per rack, prefix-aware routing
  • PD Separated: Prefill-Decode disaggregated serving with radix tree KV cache, continuous batching, KV transfer modeling

Architecture

Single-Node Mode

  TraceGenerator ──▢ Router (prefix trie) ──▢ Worker[0..3]
                                                  β”‚
                                            CacheManager
                                     HBM ──▢ DRAM ──▢ SSD
                                                  β”‚
                                        EvictionPolicy (LRU/ARC/Learned/Belady)
                                        PrefetchPolicy (None/SessionAware)

Cluster Mode (万卑 + EIC)

  Cluster: 10,240 GPUs  (simulating full 160 racks Γ— 64 GPUs)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  ClusterRouter (session affinity + prefix scoring)               β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚  Rack 0                          Rack 1              ...  Rack 7β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
  β”‚  β”‚ GPU 0  GPU 1 ... 15 β”‚        β”‚ GPU 16 ... 31  β”‚              β”‚
  β”‚  β”‚ β”Œβ”€β”€β”€β”  β”Œβ”€β”€β”€β”       β”‚        β”‚ β”Œβ”€β”€β”€β”          β”‚              β”‚
  β”‚  β”‚ β”‚HBMβ”‚  β”‚HBMβ”‚  ...  β”‚        β”‚ β”‚HBMβ”‚   ...    β”‚              β”‚
  β”‚  β”‚ β””β”€β”¬β”€β”˜  β””β”€β”¬β”€β”˜       β”‚        β”‚ β””β”€β”¬β”€β”˜          β”‚              β”‚
  β”‚  β”‚   β””β”€β”€β”€β”¬β”€β”€β”€β”˜         β”‚        β”‚   └─────┬──────│              β”‚
  β”‚  β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”β”‚              β”‚
  β”‚  β”‚  β”‚  EIC Pool   β”‚    β”‚        β”‚  β”‚  EIC Pool  β”‚β”‚              β”‚
  β”‚  β”‚  β”‚ (shared CXL)β”‚    β”‚        β”‚  β”‚ (shared)   β”‚β”‚              β”‚
  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚              β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
  β”‚  Network: intra-rack 3ΞΌs (RDMA) β”‚ cross-rack 15ΞΌs β”‚ SSD 200ΞΌs  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

PD Separation Mode

  PDCluster: 10,240 GPUs (2,560 Prefill + 7,680 Decode, P:D = 1:3)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                                                                  β”‚
  β”‚  Request ──▢ PrefillRouter ──▢ PrefillNode                      β”‚
  β”‚              (prefix match      β”‚                                β”‚
  β”‚               + load balance)   β”‚ RadixTree lookup               β”‚
  β”‚                                 β”‚ (prefix sharing, ref counting) β”‚
  β”‚                                 β”‚ Compute new KV blocks          β”‚
  β”‚                                 β”‚ SessionAware prefetch          β”‚
  β”‚                                 β–Ό                                β”‚
  β”‚                          KV Transfer (RDMA push)                 β”‚
  β”‚                          pipelined first-chunk latency           β”‚
  β”‚                                 β”‚                                β”‚
  β”‚              DecodeRouter β—€β”€β”€β”€β”€β”€β”˜                                β”‚
  β”‚              (same-rack pref    β”‚                                β”‚
  β”‚               + capacity)       β–Ό                                β”‚
  β”‚                           DecodeNode                             β”‚
  β”‚                           β”‚ Continuous batching                  β”‚
  β”‚                           β”‚ All active sequences per step        β”‚
  β”‚                           β”‚ Memory-bandwidth bound               β”‚
  β”‚                           β–Ό                                      β”‚
  β”‚                     Output tokens                                β”‚
  β”‚                                                                  β”‚
  β”‚  Per rack: [P P P P | D D D D D D D D D D D D] + shared EIC     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  TTFT = queue_wait + prefill_compute + kv_transfer + first_decode
  Key insight: Unified GPU is blocked for prefill + ALL decode steps.
               PD separation frees the prefill GPU after compute only.

Quick Start

# Install
pip install -r requirements.txt

# Single-node demo (6 policies, HBM β†’ DRAM β†’ SSD)
python main.py

# 万卑 cluster + EIC demo
python main.py --cluster

# PD separation analysis (unified vs PD, P:D ratio sweep, transfer strategies)
python main.py --pd

# Fast smoke run with machine-readable reports
python main.py --cluster --preset smoke \
  --report-json results/smoke_cluster.json \
  --report-csv results/smoke_cluster.csv

# Replay a production-style Azure LLM trace instead of a synthetic trace
python main.py --pd \
  --workload-trace /path/to/AzureLLMInferenceTrace_code.csv \
  --workload-format azure \
  --no-plot --skip-context-sweep

Real Workload Traces

trace/workload.py loads public production-style LLM traces into the same Request objects used by the synthetic generator. Timestamps are normalized to milliseconds, input tokens become prompt KV blocks, and per-request output tokens drive PD decode length when present.

Supported schemas:

Source Useful fields Notes
BurstGPT Timestamp, Session ID, Model, Request tokens, Response tokens Real Azure-powered ChatGPT/GPT-4 workload; session IDs preserve conversation prefix reuse when present.
Azure LLM Inference 2023/2024 TIMESTAMP, ContextTokens, GeneratedTokens Production Azure traces used by Splitwise and DynamoLLM; no prompt text, only token counts.
Mooncake traces timestamp, input_length, output_length, hash_ids Best fit for KV-cache experiments because hash_ids preserve prefix-sharing relationships.
SplitwiseSim arrival_timestamp, prompt_size, token_size Compatible with Splitwise-style generated traces.

Examples:

# BurstGPT CSV, excluding failed rows with zero response tokens by default
python main.py --pd \
  --workload-trace /path/to/BurstGPT_1.csv \
  --workload-format burstgpt \
  --workload-limit 4000 \
  --no-plot --skip-context-sweep

# Mooncake-style JSONL/CSV with hash_ids
python main.py --cluster \
  --workload-trace /path/to/mooncake.jsonl \
  --workload-format mooncake \
  --workload-time-unit ms \
  --no-plot --skip-context-sweep

If a trace has hash_ids, the loader uses them as the actual KV block IDs. Otherwise it synthesizes deterministic block IDs from session_id when present and falls back to unique per-request blocks. This keeps token-count-only traces useful without pretending they contain exact prefix-sharing metadata.


Production Evaluation Workflow

Use presets for repeatable run scale:

  • --preset smoke: small, fast CI/sanity run; disables training, plots, and context sweeps.
  • --preset dev: medium local run; disables plots and context sweeps.
  • --preset prod-eval: keeps the configured production-scale values.

For production-adjacent runs, prefer:

python main.py --cluster --preset prod-eval \
  --calibration-profile profiles/h100_70b_reference.yaml \
  --workload-trace /path/to/mooncake.jsonl \
  --workload-format mooncake \
  --strict-workload-validation \
  --report-json results/prod_eval_cluster.json \
  --report-csv results/prod_eval_cluster.csv \
  --no-plot --skip-context-sweep

Reports include run metadata, the full config snapshot, calibration readiness, workload validation, cluster diagnostics, and serialized result metrics. PRODUCTION_READINESS.md describes acceptance gates and confidence levels.


File Structure

kvcache-sim/
β”œβ”€β”€ sim/
β”‚   β”œβ”€β”€ storage.py        # StorageTier, KVBlock
β”‚   β”œβ”€β”€ policies.py       # LRU, ARC, Learned, BeladyOracle, prefetch policies
β”‚   β”œβ”€β”€ cache_manager.py  # Single-node multi-tier cache orchestrator
β”‚   β”œβ”€β”€ router.py         # Prefix trie + worker pool router
β”‚   β”œβ”€β”€ metrics.py        # Counters, KPIs, matplotlib visualiser
β”‚   β”œβ”€β”€ network.py        # Network latency model (intra/cross-rack, P2P RDMA)
β”‚   β”œβ”€β”€ calibration.py    # External benchmark/simulator profile overlays
β”‚   β”œβ”€β”€ cluster.py        # GPUNode, EICPool, Rack, Cluster, ClusterRouter
β”‚   β”œβ”€β”€ radix_tree.py     # KV cache radix tree (prefix sharing, ref counting)
β”‚   β”œβ”€β”€ pd_nodes.py       # PrefillNode, DecodeNode, compute models
β”‚   β”œβ”€β”€ pd_router.py      # PrefillRouter, DecodeRouter, PDOrchestrator
β”‚   β”œβ”€β”€ pd_cluster.py     # PDCluster, PDConfig, build_pd_cluster
β”‚   β”œβ”€β”€ pd_metrics.py     # TTFT/TPOT distributions, transfer stats
β”‚   └── kv_transfer.py    # KV transfer protocol (push/pull/pipeline)
β”œβ”€β”€ trace/
β”‚   β”œβ”€β”€ generator.py      # Synthetic multi-turn trace (shared system prompts)
β”‚   β”œβ”€β”€ workload.py       # Public CSV/JSONL workload loader
β”‚   β”œβ”€β”€ replay.py         # Single-node trace replay
β”‚   β”œβ”€β”€ cluster_replay.py # Cluster-scale trace replay
β”‚   └── pd_replay.py      # PD-separated trace replay
β”œβ”€β”€ learned/
β”‚   β”œβ”€β”€ features.py       # 8-dim feature engineering
β”‚   β”œβ”€β”€ train.py          # LightGBM training pipeline
β”‚   └── model.py          # Online inference wrapper
β”œβ”€β”€ experiments/
β”‚   β”œβ”€β”€ run_all.py        # Single-node + cluster experiments
β”‚   β”œβ”€β”€ pd_experiments.py # PD separation experiments
β”‚   β”œβ”€β”€ network_variance.py # Jitter/contention sensitivity experiments
β”‚   └── plot.py           # matplotlib comparison plots
β”œβ”€β”€ profiles/
β”‚   └── h100_70b_reference.yaml # Example calibrated overlay
β”œβ”€β”€ config.yaml           # Full configuration (all three modes)
β”œβ”€β”€ requirements.txt
└── main.py               # Entry point (--cluster / --pd)

PD Separation: Key Concepts

Why PD Separation?

In unified serving, a GPU does prefill (process prompt) then decode (generate tokens) sequentially. With the default 70B profile, the decode phase (128 tokens Γ— ~83.6ms) can occupy a GPU for ~10.7s, blocking new prefill work β€” this is head-of-line blocking.

PD separation dedicates GPUs to each phase:

  • Prefill nodes: Compute-bound, process prompts, free immediately after
  • Decode nodes: Memory-bandwidth-bound, generate tokens via continuous batching
  • KV transfer: RDMA push of KV cache from prefill β†’ decode node

Components

Component Description
RadixTree Prefix-sharing block tree with reference counting and leaf-only eviction
PrefillNode RadixTree-backed cache + session-aware prefetch + continuous batching
DecodeNode Receives KV via RDMA, continuous batching of active sequences
KVTransferModel Push/pull strategies, pipeline support, bandwidth modeling
PrefillRouter Prefix cache hit scoring + queue-aware load balancing
DecodeRouter Same-rack preference (fast transfer) + capacity-aware

Compute Model (H100, 70B default profile)

Phase Formula Value
Prefill 2 Γ— params / TFLOPS ~0.35 ms/token
Decode 2 Γ— params / HBM_BW ~83.6 ms/token
Decode (64 seq batch) base + marginal KV overhead ~93.6 ms/step
KV transfer first chunk 16 blocks Γ— 5 MiB / 12.5 GB/s ~6.7 ms
KV transfer full 8K prompt 512 blocks Γ— 5 MiB / 12.5 GB/s ~215 ms

PD Separation: Example Output

================================================================
  kvcache-sim  β€”  PD Separation Mode
  PDCluster: 10,240 GPUs (2,560P + 7,680D, ratio 1:3) Γ— 160 racks
================================================================

The exact numbers are workload-dependent. The PD replayer now keeps decode
sequences active across requests, so TTFT/TPOT reflect queueing, decode overlap,
P:D ratio, prompt length, output length, and KV transfer settings.

Key interpretation:
- Unified serving blocks a GPU for prefill plus the full decode phase.
- PD serving frees prefill GPUs after prefill, then admits transferred KV into decode nodes.
- Transfer reports both full KV movement and pipelined first-chunk latency for TTFT.
- P:D sweeps are meaningful only for the configured workload and hardware profile.

Single-Node: Policy Comparison

# Policy Eviction Prefetch Notes
1 Baseline LRU Least Recently Used None Classic, near-optimal for sequential workloads
2 +ARC Adaptive Replacement Cache None Balances recency & frequency (T1/T2 + ghost lists)
3 +SessionPrefetch LRU Session-Aware Predicts next blocks from session patterns
4 +SelectiveWrite LRU None Only caches shallow-prefix blocks (depth <= 3)
5 +Learned LightGBM reuse predictor None Trained on trace; predicts reuse distance
6 Belady Oracle Optimal (offline) None Upper bound β€” evicts farthest-future block

Configuration

All parameters in config.yaml. Key PD separation settings:

pd_separation:
  pd_ratio: [1, 3]              # Prefill:Decode GPU ratio
  compute:
    prefill_tflops: 800          # H100 FP16 effective TFLOPS
    decode_memory_bw_gbps: 3200  # H100 HBM bandwidth
    model_params_b: 70           # Model size (billions)
    kv_bytes_per_token: 327680   # 320 KiB for Llama-3-70B GQA
    tokens_per_block: 16
    prefill_batch_efficiency: 0.85
    decode_kv_overhead_factor: 0.02
  transfer:
    strategy: push               # push | pull | pull_on_demand
    rdma_bw_gbps: 12.5           # effective GB/s for 100 Gbps RDMA
    pipelining: true

Calibration Profiles

This simulator is intentionally system-level. It should not embed slow cycle-level GPU, DRAM, or SSD simulators in the main replay loop. Instead, run microbenchmarks or external simulators offline, then overlay their calibrated parameters:

python main.py --config config.yaml \
  --calibration-profile profiles/h100_70b_reference.yaml --pd

python -m experiments.network_variance \
  --config config.yaml \
  --calibration-profile profiles/h100_70b_reference.yaml

Supported overlay sections are hardware, cache, cluster, and pd_separation. Unknown sections are rejected so typoed calibration keys do not silently change nothing.

Recommended calibration sources:

Layer Use in this repo External source
LLM operator timing pd_separation.compute.* Vidur-style profiling or real serving traces
GPU/HBM bandwidth hardware.hbm, cluster.gpu Accel-Sim for kernels, plus bandwidth microbenchmarks
CPU DRAM/HBM details hardware.dram, EIC params Ramulator2 or DRAMsim3
SSD/NVMe hardware.ssd MQSim or SimpleSSD
Network/RDMA/NVLink cluster.network, transfer params RDMA/NVLink microbenchmarks, ASTRA-sim/ns-3 for topology effects

The default profile in profiles/h100_70b_reference.yaml mirrors the default configuration and documents the expected shape for calibrated values.


What You Can Optimize With This Simulator

  1. P:D Ratio Selection β€” Find optimal prefill/decode GPU split for your QPS and prompt lengths
  2. Prefix Cache Capacity Planning β€” How much HBM/EIC to allocate for KV cache vs model weights
  3. Interconnect Bandwidth ROI β€” Compare 25/50/100/200 Gbps for KV transfer overhead
  4. Eviction Policy Selection β€” LRU vs ARC vs Learned under different workload patterns
  5. EIC Sizing β€” How much shared CXL memory per rack for cross-GPU prefix reuse
  6. Context Length Impact β€” How 4K vs 32K vs 128K contexts affect cache dynamics and PD benefit

License

MIT

About

Multi-tier KV cache hierarchical storage simulator for LLM inference (HBM/DRAM/SSD) with pluggable eviction and prefetch policies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages