Efficient LLM Serving with Multi-Tier, Prefix-Aware KV Cache Sharing for Scalable Multi-Agent Systems
This repository contains the implementation of benchmarks described in the report and scripts for deploying the vLLM server and client programs. The implementation of MT-APC is available in a separate repository here.
Recent advancements in large language models (LLMs) have significantly enhanced their performance and efficiency, enabling the development of single-agent and multi-agent (MA) systems for complex tasks such as automation and agent-based simulations. However, the high computational time and cost of LLM inference limit their applicability in sophisticated or large-scale tasks, particularly as input size grows with the number of agents in the system.
Since agents in most MA systems often share identical system prompts and engage in multi-turn interactions, KV cache optimization techniques like prefix-aware KV caching can be used to reuse the KV cache across requests with common prefixes, thereby eliminating redundant computations and reducing memory requirements. Nevertheless, as the number of agents scales to thousands or millions, the effectiveness of KV cache sharing diminishes, leading to increased recomputations.
To address this, we introduce MT-APC, a novel prefix-aware KV caching mechanism that leverages a memory hierarchy, including GPU, CPU, and disk storage, to store a large KV cache. Combined with asynchronous KV cache movement and device- and prefix-aware scheduling mechanisms, our system efficiently caches and reuses a larger amount of KV tensors. We evaluate our system on large-scale MA workloads, demonstrating its effectiveness in improving throughput and reducing time to first token.
This repository supports evaluating and benchmarking MT-APC and other baselines under different workloads and includes:
- Scripts to set up and deploy the vLLM server and client programs.
- Tools for evaluating the serving systems' performance with different kinds of workloads, including large-scale multi-agent applications.
-
Clone this repository:
git clone https://github.com/kaiitunnz/elmas.git cd elmas -
Create a Conda environment:
conda create -n elmas python=3.11 && conda activate elmas -
Install dependencies:
bash scripts/install.sh
-
Create a
.envfile for default server configuration. See .env.example for reference.
Note: If you encounter the following error: ImportError: cannot import name 'cached_download' from 'huggingface_hub', add the following line to <PYTHON_LIBRARY_PATH>/huggingface_hub/__init__.py.
cached_download = NoneThe project depends on the following:
- Python version 3.11 (tested only on this version)
- CUDA-compatible GPU (for GPU acceleration)
- Dependencies such as our versions of vLLM and GPTSwarm. See scripts/install.sh for installation.
After installing this package and its dependencies with scripts/install.sh, you can perform the following actions.
We provide a script to run the benchmarks we used in the report. See scripts/experiments.sh or run the following command.
bash scripts/experiments.shThe following command template can be used to run individual benchmarks.
python -O benchmarks/runner.py \
--benchmarks <benchmark-name> \
--servers <server-name> \
--num-trials=5 \
--result-dir=</path/to/result/dir> \
--clear-result-dirSee benchmarks/runner.py or run the following command to see the lists of benchmarks and servers.
python benchmarks/runner.py --helpWe provide a script for starting the vLLM server with various options. Below are some examples. You can remove the -O flag to enable assert statements.
-
vLLM server without prefix caching
python -Om agents.utils.vllm.start_server \ --preemption-mode=recompute
-
vLLM server with APC
python -Om agents.utils.vllm.start_server \ --enable-prefix-caching \ --preemption-mode=recompute
-
vLLM server with MT-APC
python -Om agents.utils.vllm.start_server \ --enable-prefix-caching \ --enable-multi-tier-prefix-caching \ --enable-async-swapping \ --enable-prefix-aware-scheduling \ --enable-async-prefetching \ --scheduler-window-size=10 \ --preemption-mode=recompute
-
vLLM server with MT-APC and profiling enabled
python -Om agents.utils.vllm.start_server \ --enable-prefix-caching \ --enable-multi-tier-prefix-caching \ --enable-async-swapping \ --enable-prefix-aware-scheduling \ --enable-async-prefetching \ --scheduler-window-size=10 \ --profiling \ --preemption-mode=recompute
We provide several example client programs listed below.
-
Chatbot applications
python -m agents.chatbot.chatbot # Chatbot assistant python -m agents.chatbot.completion # LLM completion python -m agents.chatbot.profile # Simple prompts for profiling the server
-
GPTSwarm's agent applications
python -m agents.gptswarm.guessing_game --num-participants=20 --num-steps=5 # Guessing Game simulation python -m agents.gptswarm.gaia # GAIA application python -m agents.gptswarm.crosswords # Mini CrossWords application
We welcome contributions! Please: 1. Fork the repository. 2. Create a new branch for your feature or bugfix. 3. Submit a pull request detailing your changes.
For further questions, please contact the authors listed in the paper or open an issue in this repository.