diff --git a/doc/source/cluster/kubernetes/examples/rayserve-llm-example.md b/doc/source/cluster/kubernetes/examples/rayserve-llm-example.md index 77e48dc45064..b4048e151f4b 100644 --- a/doc/source/cluster/kubernetes/examples/rayserve-llm-example.md +++ b/doc/source/cluster/kubernetes/examples/rayserve-llm-example.md @@ -163,3 +163,13 @@ kubectl port-forward svc/ray-serve-llm-head-svc 8265 Once forwarded, navigate to the Serve tab on the dashboard to review application status, deployments, routers, logs, and other relevant features. ![LLM Serve Application](../images/ray_dashboard_llm_application.png) + +## Add custom dependencies + +To install additional custom packages necessary for the LLM services that can't be directly installed through `runtime_env`, such as KV cache backends, see [Add custom dependencies](kuberay-rayservice-custom-deps) in the RayService user guide. For a complete example, refer to the LMCache and Mooncake integration for distributed KV cache at [Deploy on Kubernetes with LMCache and Mooncake](kv-cache-offloading-guide). + +Download a basic example: + +```sh +curl -o ray-service.extra-dependency.yaml https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.extra-dependency.yaml +``` diff --git a/doc/source/cluster/kubernetes/user-guides/rayservice.md b/doc/source/cluster/kubernetes/user-guides/rayservice.md index 05eba6564430..1b92ef237aff 100644 --- a/doc/source/cluster/kubernetes/user-guides/rayservice.md +++ b/doc/source/cluster/kubernetes/user-guides/rayservice.md @@ -296,6 +296,64 @@ helm uninstall kuberay-operator kubectl delete pod curl ``` +(kuberay-rayservice-custom-deps)= +## Add custom dependencies + +You may need to install additional packages in your Ray containers. KubeRay supports two approaches depending on whether dependencies are shared across all applications or specific to one. + +### Install shared dependencies via args + +Use the `args` field to install system packages and Python dependencies that all applications in the RayService need. These packages install at container startup and are accessible to every application: + +```yaml +workerGroupSpecs: + - groupName: worker-group + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.53.0 + args: + - | + sudo apt-get update && \ + sudo apt-get install -y --no-install-recommends curl && \ + sudo rm -rf /var/lib/apt/lists/* && \ + pip install httpx \ +``` + +You can also install Python packages via `runtime_env`, but using `args` makes them available to all applications and avoids repeated installation. + +### Install application-specific dependencies via runtime_env + +For dependencies that only a specific application needs, use `runtime_env` in your Serve configuration. This approach lets you install different packages for different applications: + +```yaml +serveConfigV2: | + applications: + - name: fruit_app + import_path: fruit.deployment_graph + runtime_env: + pip: + - pandas + - name: math_app + import_path: conditional_dag.serve_dag + runtime_env: + pip: + - numpy +``` + +Download a complete example combining both approaches: + +```sh +curl -o ray-service.extra-dependency.yaml https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.extra-dependency.yaml +``` + +:::{note} +Packages installed via `args` install on every container restart. For production, consider building a custom image with shared dependencies pre-installed. +::: + +For advanced container command customization, see [Specify container commands](kuberay-pod-command). + ## Next steps * See [RayService high availability](kuberay-rayservice-ha) for more details on RayService HA. diff --git a/doc/source/serve/llm/user-guides/kv-cache-offloading.md b/doc/source/serve/llm/user-guides/kv-cache-offloading.md index 88a85d3cffbb..1cf9082b571e 100644 --- a/doc/source/serve/llm/user-guides/kv-cache-offloading.md +++ b/doc/source/serve/llm/user-guides/kv-cache-offloading.md @@ -263,7 +263,85 @@ Extending KV cache beyond local GPU memory introduces overhead for managing and **Network transfer costs**: When combining MultiConnector with cross-instance transfer (such as NIXL), ensure that the benefits of disaggregation outweigh the network transfer costs. +## Deploy on Kubernetes with LMCache and Mooncake + +For distributed KV cache sharing across multiple GPU workers, you can use LMCache with Mooncake as the storage backend. Mooncake creates a distributed memory pool by aggregating memory from multiple nodes, supporting high-bandwidth RDMA or TCP transfer. + +### Install system packages + +Mooncake requires system-level dependencies. Use the `args` field in your RayService worker spec to install them at container startup: + +```yaml +# Reference for system packages required by Mooncake: https://kvcache-ai.github.io/Mooncake/getting_started/build.html +- name: ray-worker + image: rayproject/ray-llm:2.53.0-py311-cu128 + args: + - | + sudo apt-get update && \ + sudo apt-get install -y --no-install-recommends \ + build-essential cmake libibverbs-dev libgoogle-glog-dev \ + libgtest-dev libjsoncpp-dev libnuma-dev libunwind-dev \ + libpython3-dev libboost-all-dev libssl-dev pybind11-dev \ + libcurl4-openssl-dev libhiredis-dev pkg-config patchelf && \ + sudo rm -rf /var/lib/apt/lists/* \ +``` + +For general Kubernetes dependency patterns, also see [Add custom dependencies](kuberay-rayservice-custom-deps) in the RayService user guide. + +### Install LMCache and Mooncake via runtime_env + +Add the Python packages through `runtime_env` in your LLM configuration: + +```yaml +runtime_env: + pip: + - lmcache + - mooncake-transfer-engine + env_vars: + LMCACHE_CONFIG_FILE: "/mnt/configs/lmcache-config.yaml" + PYTHONHASHSEED: "0" # Required for consistent pre-caching keys +``` + +### Configure kv_transfer_config + +Enable the LMCache connector in your engine configuration: + +```yaml +engine_kwargs: + kv_transfer_config: + kv_connector: "LMCacheConnectorV1" + kv_role: "kv_both" +``` + +### LMCache configuration for Mooncake + +Create a ConfigMap with your LMCache configuration file and mount it on your worker containers. The configuration specifies the Mooncake master address, metadata server, and transfer protocol. For the full configuration reference, see the [LMCache Mooncake backend documentation](https://docs.lmcache.ai/kv_cache/storage_backends/mooncake.html). + +Example `lmcache-config.yaml`: + +```yaml +chunk_size: 256 +remote_url: "mooncakestore://mooncake-master.default.svc.cluster.local:50051/" # Pre-setup mooncake server by creating custom deployment. +remote_serde: "naive" +local_cpu: true +max_local_cpu_size: 64 # GB + +extra_config: + # Using etcd instead of mooncake metadata server, since it is currently unstable and it is advised to use etcd. + # Refer here for setting up etcd in K8s environment: https://etcd.io/docs/v3.6/op-guide/kubernetes/ + metadata_server: "etcd://etcd.default.svc.cluster.local:2379" + protocol: "tcp" # Use "rdma" for RDMA-capable networks + master_server_address: "mooncake-master.default.svc.cluster.local:50051" + global_segment_size: 21474836480 # 20GB per worker +``` + +:::{note} +This setup requires a running Mooncake master service and metadata server (etcd or HTTP). See the [Mooncake deployment guide](https://kvcache-ai.github.io/Mooncake/getting_started/build.html) for infrastructure setup. +::: + ## See also - {doc}`Prefill/decode disaggregation ` - Deploy LLMs with separated prefill and decode phases - [LMCache documentation](https://docs.lmcache.ai/) - Comprehensive LMCache configuration and features +- [LMCache Mooncake backend](https://docs.lmcache.ai/kv_cache/storage_backends/mooncake.html) - Distributed KV cache storage setup +- [Mooncake build guide](https://kvcache-ai.github.io/Mooncake/getting_started/build.html) - System dependencies and installation diff --git a/doc/source/serve/production-guide/handling-dependencies.md b/doc/source/serve/production-guide/handling-dependencies.md index 8339bc1e830a..6027bc79938d 100644 --- a/doc/source/serve/production-guide/handling-dependencies.md +++ b/doc/source/serve/production-guide/handling-dependencies.md @@ -59,3 +59,7 @@ Example: ```{literalinclude} ../doc_code/delayed_import.py :language: python ``` + +## Handle dependencies on Kubernetes + +When deploying Ray Serve on Kubernetes with KubeRay, you can install packages at container startup using the `args` field or via `runtime_env` in your Serve config. See [Add custom dependencies](kuberay-rayservice-custom-deps) in the RayService user guide for details and examples.