Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions doc/source/cluster/kubernetes/examples/rayserve-llm-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,3 +163,13 @@ kubectl port-forward svc/ray-serve-llm-head-svc 8265

Once forwarded, navigate to the Serve tab on the dashboard to review application status, deployments, routers, logs, and other relevant features.
![LLM Serve Application](../images/ray_dashboard_llm_application.png)

## Add custom dependencies

To install additional custom packages necessary for the LLM services that can't be directly installed through `runtime_env`, such as KV cache backends, see [Add custom dependencies](kuberay-rayservice-custom-deps) in the RayService user guide. For a complete example, refer to the LMCache and Mooncake integration for distributed KV cache at [Deploy on Kubernetes with LMCache and Mooncake](kv-cache-offloading-guide).

Download a basic example:

```sh
curl -o ray-service.extra-dependency.yaml https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.extra-dependency.yaml
```
58 changes: 58 additions & 0 deletions doc/source/cluster/kubernetes/user-guides/rayservice.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,64 @@ helm uninstall kuberay-operator
kubectl delete pod curl
```

(kuberay-rayservice-custom-deps)=
## Add custom dependencies

You may need to install additional packages in your Ray containers. KubeRay supports two approaches depending on whether dependencies are shared across all applications or specific to one.

### Install shared dependencies via args

Use the `args` field to install system packages and Python dependencies that all applications in the RayService need. These packages install at container startup and are accessible to every application:

```yaml
workerGroupSpecs:
- groupName: worker-group
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.53.0
args:
- |
sudo apt-get update && \
sudo apt-get install -y --no-install-recommends curl && \
sudo rm -rf /var/lib/apt/lists/* && \
pip install httpx \
```

You can also install Python packages via `runtime_env`, but using `args` makes them available to all applications and avoids repeated installation.

### Install application-specific dependencies via runtime_env

For dependencies that only a specific application needs, use `runtime_env` in your Serve configuration. This approach lets you install different packages for different applications:

```yaml
serveConfigV2: |
applications:
- name: fruit_app
import_path: fruit.deployment_graph
runtime_env:
pip:
- pandas
- name: math_app
import_path: conditional_dag.serve_dag
runtime_env:
pip:
- numpy
```

Download a complete example combining both approaches:

```sh
curl -o ray-service.extra-dependency.yaml https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.extra-dependency.yaml
```

:::{note}
Packages installed via `args` install on every container restart. For production, consider building a custom image with shared dependencies pre-installed.
:::

For advanced container command customization, see [Specify container commands](kuberay-pod-command).

## Next steps

* See [RayService high availability](kuberay-rayservice-ha) for more details on RayService HA.
Expand Down
78 changes: 78 additions & 0 deletions doc/source/serve/llm/user-guides/kv-cache-offloading.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,85 @@ Extending KV cache beyond local GPU memory introduces overhead for managing and
**Network transfer costs**: When combining MultiConnector with cross-instance transfer (such as NIXL), ensure that the benefits of disaggregation outweigh the network transfer costs.


## Deploy on Kubernetes with LMCache and Mooncake

For distributed KV cache sharing across multiple GPU workers, you can use LMCache with Mooncake as the storage backend. Mooncake creates a distributed memory pool by aggregating memory from multiple nodes, supporting high-bandwidth RDMA or TCP transfer.

### Install system packages

Mooncake requires system-level dependencies. Use the `args` field in your RayService worker spec to install them at container startup:

```yaml
# Reference for system packages required by Mooncake: https://kvcache-ai.github.io/Mooncake/getting_started/build.html
- name: ray-worker
image: rayproject/ray-llm:2.53.0-py311-cu128
args:
- |
sudo apt-get update && \
sudo apt-get install -y --no-install-recommends \
build-essential cmake libibverbs-dev libgoogle-glog-dev \
libgtest-dev libjsoncpp-dev libnuma-dev libunwind-dev \
libpython3-dev libboost-all-dev libssl-dev pybind11-dev \
libcurl4-openssl-dev libhiredis-dev pkg-config patchelf && \
sudo rm -rf /var/lib/apt/lists/* \
```

For general Kubernetes dependency patterns, also see [Add custom dependencies](kuberay-rayservice-custom-deps) in the RayService user guide.

### Install LMCache and Mooncake via runtime_env

Add the Python packages through `runtime_env` in your LLM configuration:

```yaml
runtime_env:
pip:
- lmcache
- mooncake-transfer-engine
env_vars:
LMCACHE_CONFIG_FILE: "/mnt/configs/lmcache-config.yaml"
PYTHONHASHSEED: "0" # Required for consistent pre-caching keys
```

### Configure kv_transfer_config

Enable the LMCache connector in your engine configuration:

```yaml
engine_kwargs:
kv_transfer_config:
kv_connector: "LMCacheConnectorV1"
kv_role: "kv_both"
```

### LMCache configuration for Mooncake

Create a ConfigMap with your LMCache configuration file and mount it on your worker containers. The configuration specifies the Mooncake master address, metadata server, and transfer protocol. For the full configuration reference, see the [LMCache Mooncake backend documentation](https://docs.lmcache.ai/kv_cache/storage_backends/mooncake.html).

Example `lmcache-config.yaml`:

```yaml
chunk_size: 256
remote_url: "mooncakestore://mooncake-master.default.svc.cluster.local:50051/" # Pre-setup mooncake server by creating custom deployment.
remote_serde: "naive"
local_cpu: true
max_local_cpu_size: 64 # GB

extra_config:
# Using etcd instead of mooncake metadata server, since it is currently unstable and it is advised to use etcd.
# Refer here for setting up etcd in K8s environment: https://etcd.io/docs/v3.6/op-guide/kubernetes/
metadata_server: "etcd://etcd.default.svc.cluster.local:2379"
protocol: "tcp" # Use "rdma" for RDMA-capable networks
master_server_address: "mooncake-master.default.svc.cluster.local:50051"
global_segment_size: 21474836480 # 20GB per worker
```

:::{note}
This setup requires a running Mooncake master service and metadata server (etcd or HTTP). See the [Mooncake deployment guide](https://kvcache-ai.github.io/Mooncake/getting_started/build.html) for infrastructure setup.
:::

## See also

- {doc}`Prefill/decode disaggregation <prefill-decode>` - Deploy LLMs with separated prefill and decode phases
- [LMCache documentation](https://docs.lmcache.ai/) - Comprehensive LMCache configuration and features
- [LMCache Mooncake backend](https://docs.lmcache.ai/kv_cache/storage_backends/mooncake.html) - Distributed KV cache storage setup
- [Mooncake build guide](https://kvcache-ai.github.io/Mooncake/getting_started/build.html) - System dependencies and installation
4 changes: 4 additions & 0 deletions doc/source/serve/production-guide/handling-dependencies.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,7 @@ Example:
```{literalinclude} ../doc_code/delayed_import.py
:language: python
```

## Handle dependencies on Kubernetes

When deploying Ray Serve on Kubernetes with KubeRay, you can install packages at container startup using the `args` field or via `runtime_env` in your Serve config. See [Add custom dependencies](kuberay-rayservice-custom-deps) in the RayService user guide for details and examples.