-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[router] cache-aware load-balancing router v1 #2114
Conversation
dadf3e1
to
4f3c5d7
Compare
1b37cd3
to
961c5a6
Compare
da1e53d
to
98af490
Compare
e38ddf7
to
441c5b6
Compare
.map(|kv| kv.key().to_owned()) | ||
.unwrap_or("empty".to_string()); | ||
|
||
// Traverse from the curr node to the root and update the timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not important, but this could be happened during the matching process (traverse takes time, but maybe is not the bottleneck now).
|
||
if curr.children.contains_key(first_id) { | ||
let child = curr.children.get(first_id).unwrap(); | ||
pub fn evict_tenant_data(&self, max_size: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The priority queue (actually a linked list is better) can be maintained, rather than recompute each eviction time. The current implementation is also fine for fast move, since it is actually a lazy eviction and the complexity is amortized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain how LL works?
Motivation
Related to #1732
This PR finishes the first version of cache-aware load-balancing router. For long shared prefix data, It can achieve 2x throughput compared with existing round-robin DP controller.
Usage
The router offers two modes:
1. Co-launch workers and router
This will be a drop-in replacement for the existing
--dp-size
. This part of code will be moved into sglang core.Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
2. Launch only router
This is useful if you for multi node DP. You can launch workers on different nodes, then connect the router to them.
Strategy
Cache-Aware Load-Balancing Router
This router combines two strategies to optimize both cache utilization and request distribution:
1. Cache-Aware Routing (Approximate Tree)
This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.
Process:
2. Load-Balancing (Shortest Queue)
This strategy tracks pending request counts per worker and routes new requests
to the least busy worker for optimal load distribution.
Configuration Parameters
cache_routing_prob
: (float, 0.0 to 1.0)cache_threshold
: (float, 0.0 to 1.0)eviction_interval_secs
: (integer)max_tree_size
: (integer)Benchmark Results
Generated Shared Prefix Dataset
SharedGPT Dataset
The performance does not degrade for non cache heavy case
Multi Turn Dataset
Generated Shared Prefix Dataset but only has one system prompt
Full cache aware has perf degradation because all requests are routed to one node. We can tune routing prob to 0.5 to beat naive RR.
Reference: https://docs.google.com/spreadsheets/d/1Y_dY4EGpk26MsehoWf6K85p7BXBBWlXI-gk6-Ei5-cs/edit?gid=1463925947#gid=1463925947