Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[router] cache-aware load-balancing router v1 #2114

Merged
merged 9 commits into from
Nov 23, 2024

Conversation

ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented Nov 21, 2024

Motivation

Related to #1732

This PR finishes the first version of cache-aware load-balancing router. For long shared prefix data, It can achieve 2x throughput compared with existing round-robin DP controller.

Usage

The router offers two modes:

1. Co-launch workers and router

This will be a drop-in replacement for the existing --dp-size. This part of code will be moved into sglang core.
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.

$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8

2. Launch only router

This is useful if you for multi node DP. You can launch workers on different nodes, then connect the router to them.

$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000

$ python -m sglang_router.launch_router --help
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
                        [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
                        [--cache-routing-prob CACHE_ROUTING_PROB] [--eviction-interval EVICTION_INTERVAL]
                        [--max-tree-size MAX_TREE_SIZE]

options:
  -h, --help            show this help message and exit
  --host HOST           Host address to bind the router server (default: 127.0.0.1)
  --port PORT           Port number to bind the router server (default: 30000)
  --worker-urls WORKER_URLS [WORKER_URLS ...]
                        List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
  --policy {random,round_robin,cache_aware}
                        Load balancing policy to use (default: cache_aware)
  --cache-threshold CACHE_THRESHOLD
                        Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
  --cache-routing-prob CACHE_ROUTING_PROB
                        Probability of using cache-aware routing (0.0-1.0) (default: 1.0)
  --eviction-interval EVICTION_INTERVAL
                        Interval in seconds between cache eviction operations (default: 60)
  --max-tree-size MAX_TREE_SIZE
                        Maximum size of the approximation tree for cache-aware routing (default: 16777216)

Strategy

Cache-Aware Load-Balancing Router

This router combines two strategies to optimize both cache utilization and request distribution:

  1. Cache-Aware Routing (Approximate Tree)
  2. Load-Balancing Routing (Shortest Queue)

1. Cache-Aware Routing (Approximate Tree)

This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.

Process:

  • For each request, find the worker with the highest prefix match
  • If match rate > cache_threshold:
    • Route to the worker with highest match (likely has relevant data cached)
  • If match rate ≤ cache_threshold:
    • Route to the worker with smallest tree size (most available cache capacity)
  • Background maintenance:
    • Periodically evict least recently used leaf nodes to prevent memory overflow

2. Load-Balancing (Shortest Queue)

This strategy tracks pending request counts per worker and routes new requests
to the least busy worker for optimal load distribution.

Configuration Parameters

  1. cache_routing_prob: (float, 0.0 to 1.0)

    • 0.0: Exclusively use load balancing
    • 1.0: Exclusively use cache-aware routing
    • Between 0-1: Probability of using cache-aware routing vs load balancing
  2. cache_threshold: (float, 0.0 to 1.0)

    • Minimum prefix match ratio to use highest-match routing
    • Below this threshold, routes to worker with most available cache space
  3. eviction_interval_secs: (integer)

    • Interval between LRU eviction cycles for the approximate trees
  4. max_tree_size: (integer)

    • Maximum nodes per tree
    • When exceeded, LRU leaf nodes are evicted during the next eviction cycle

Benchmark Results

Generated Shared Prefix Dataset

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix \
    --generated-input-path ~/.cache/gen.json --generated-input-save-path ~/.cache/gen.json
Method Throughput Cache Rate
Original RR DP 82,665 20%
Cache Aware v1 158,596.72 75%
Perfect 160,288 75%

SharedGPT Dataset

python bench_serving.py --host 127.0.0.1 --port 30000

The performance does not degrade for non cache heavy case

Method Throughput Cache Rate
Original RR DP 17,164 2%
Cache Aware v1 17,775 2%

Multi Turn Dataset

python long_prompt_multi_turn.py --port 30000 --tokenizer "/shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693/" | tee client.log
Method Latency Cache Rate
Original RR DP 34 35%
Cache Aware v1 19 88%
Perfect 19 88%

Generated Shared Prefix Dataset but only has one system prompt

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix --gen-num-groups 8 --gen-num-groups 1 --gen-prompts-per-group 1024

Full cache aware has perf degradation because all requests are routed to one node. We can tune routing prob to 0.5 to beat naive RR.

Version Throughput Cache Rate
Original RR DP 154535.56
Cache aware v1 36510.71
Cache aware v1 - routing prob 0.5 190026.64

Reference: https://docs.google.com/spreadsheets/d/1Y_dY4EGpk26MsehoWf6K85p7BXBBWlXI-gk6-Ei5-cs/edit?gid=1463925947#gid=1463925947

@ByronHsu ByronHsu changed the title cache aware dp v1 [router] cache aware dp v1 Nov 21, 2024
@ByronHsu ByronHsu force-pushed the byhsu/approx-v2-new branch from dadf3e1 to 4f3c5d7 Compare November 21, 2024 20:39
@ByronHsu ByronHsu changed the title [router] cache aware dp v1 [router] cache-aware load-balancing router Nov 21, 2024
@ByronHsu ByronHsu changed the title [router] cache-aware load-balancing router [router] cache-aware load-balancing router v1 Nov 21, 2024
@ByronHsu ByronHsu force-pushed the byhsu/approx-v2-new branch from 1b37cd3 to 961c5a6 Compare November 21, 2024 22:50
@ByronHsu ByronHsu marked this pull request as ready for review November 21, 2024 22:50
@ByronHsu ByronHsu force-pushed the byhsu/approx-v2-new branch from da1e53d to 98af490 Compare November 22, 2024 07:34
.map(|kv| kv.key().to_owned())
.unwrap_or("empty".to_string());

// Traverse from the curr node to the root and update the timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not important, but this could be happened during the matching process (traverse takes time, but maybe is not the bottleneck now).


if curr.children.contains_key(first_id) {
let child = curr.children.get(first_id).unwrap();
pub fn evict_tenant_data(&self, max_size: usize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The priority queue (actually a linked list is better) can be maintained, rather than recompute each eviction time. The current implementation is also fine for fast move, since it is actually a lazy eviction and the complexity is amortized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain how LL works?

@ByronHsu ByronHsu merged commit cbedd1d into sgl-project:main Nov 23, 2024
14 of 15 checks passed
@merrymercy merrymercy mentioned this pull request Nov 24, 2024
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants