Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion examples/inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@ Framework-centric inference engine examples, organized by serving engine.
| Engine | Example | Description |
|---|---|---|
| [`vllm`](./vllm) | [`dsv3-uccl-nixl`](./vllm/dsv3-uccl-nixl) | DeepSeek-V3 disaggregated (prefill/decode) inference with vLLM, UCCL-EP, and NIXL on EKS |
| [`sglang`](./sglang) | [`qwen3.5-27b-b300-intra-pd`](./sglang/qwen3.5-27b-b300-intra-pd) | Qwen3.5-27B with intra-node prefill/decode disaggregation on a single B300 node |
| [`sglang`](./sglang) | [`kimi2.6-h200-1p1d`](./sglang/kimi2.6-h200-1p1d) | Kimi2.6 with node-level 1-prefill / 1-decode disaggregation across two H200 nodes |
| [`sglang`](./sglang) | [`dsv4pro-b300-single-node`](./sglang/dsv4pro-b300-single-node) | DeepSeek V4 Pro unified (non-PD) serving on a single B300 node |

More engines (SGLang, TRT-LLM, NIM, Dynamo, Ray Serve, …) are planned, including
More engines (TRT-LLM, NIM, Dynamo, Ray Serve, …) are planned, including
content to be merged from [`aws-samples/awsome-inference`](https://github.com/aws-samples/awsome-inference)
(see issue #1056).
99 changes: 99 additions & 0 deletions examples/inference/sglang/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
<!--
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0
-->

# SGLang test cases

[SGLang](https://github.com/sgl-project/sglang) deployments on AWS EKS /
SageMaker HyperPod. Each sub-directory is a self-contained sample — apply its
manifest with `kubectl`.

| Test case | Hardware | Topology |
| --- | --- | --- |
| [`qwen3.5-27b-b300-intra-pd`](./qwen3.5-27b-b300-intra-pd) | 1× B300 (8 GPU) | Intra-node PD — 6 prefill + 2 decode in one pod, NIXL, SGLang router sidecar |
| [`kimi2.6-h200-1p1d`](./kimi2.6-h200-1p1d) | 2× H200 nodes | Node-level 1P1D — prefill + decode StatefulSets, NIXL over EFA |
| [`dsv4pro-b300-single-node`](./dsv4pro-b300-single-node) | 1× B300 (8 GPU) | Unified (non-PD) baseline |

## Shared helpers

Reusable across all the samples above:

### Pre-stage model weights

Download a Hugging Face repo to every matching node's local NVMe
(`/opt/dlami/nvme`) so the serving pods read weights from fast local disk
instead of pulling them at startup. [`download-model.sh`](./download-model.sh)
renders [`download-model-daemonset.yaml`](./download-model-daemonset.yaml) and
applies it — `LOCAL_DIR_NAME` defaults to the repo id with `/` → `-`:

```bash
./download-model.sh moonshotai/Kimi-K2.5 ml.p5en.48xlarge
./download-model.sh deepseek-ai/DeepSeek-V4-Pro ml.p6-b300.48xlarge
# watch: kubectl logs -f -l app=model-downloader (each node prints "Download complete!")
# then: kubectl delete daemonset model-downloader
```

### Monitoring (Prometheus + Grafana)

The serving pods already expose SGLang metrics on `:30000/metrics` (started with
`--enable-metrics`) and carry the `sglang-metrics=true` label plus the
`prometheus.io/*` scrape annotations. The monitoring path is fully AWS-managed:
an in-cluster Prometheus **agent** remote-writes to **Amazon Managed Prometheus
(AMP)**, and **Amazon Managed Grafana** reads from AMP — there is no in-cluster
Grafana.

**1. AMP + Prometheus agent (scripted)** —
[`setup-amp-monitoring.sh`](./setup-amp-monitoring.sh) is idempotent and does the
three one-time steps in order: create (or reuse) an AMP workspace, enable the
cluster OIDC provider and create the AMP ingest IAM role bound to the
`amp-iamproxy-ingest-service-account` ServiceAccount, then render
[`prometheus-agent-amp.yaml`](./prometheus-agent-amp.yaml) with the real
workspace id / role ARN / region and apply it.

```bash
./setup-amp-monitoring.sh <CLUSTER_NAME> [REGION] [AMP_ALIAS]
# e.g. ./setup-amp-monitoring.sh eks-hypd-0512-b2ad us-west-2 sglang-kimi
# then watch the agent leave CrashLoopBackOff:
# kubectl rollout status deployment/prometheus-agent
```

The agent scrapes every pod labeled `sglang-metrics=true` or `dcgm-metrics=true`
and remote-writes via SigV4. (Requires `awscli`, `eksctl`, `kubectl`, `envsubst`
and AWS creds with AMP + IAM permissions.)

**2. GPU metrics** — [`dcgm-exporter-daemonset.yaml`](./dcgm-exporter-daemonset.yaml)
runs a DCGM exporter DaemonSet on `:9400` (labeled `dcgm-metrics=true`, so the
agent above picks it up automatically). Apply it:

```bash
kubectl apply -f dcgm-exporter-daemonset.yaml
```

The manifest schedules onto nodes labeled `nvidia.com/gpu.present=true`. This
label is **not** present by default on SageMaker HyperPod nodes — it is the
NVIDIA GPU Operator convention, and HyperPod doesn't run the Operator. So on a
plain HyperPod cluster the DaemonSet comes up with `DESIRED 0` and never starts
a pod. Two ways to fix it:

- **Quick:** label the GPU nodes by hand —
`kubectl label nodes <node>... nvidia.com/gpu.present=true`. Simple, but the
label does **not** survive node replacement: if HyperPod swaps a node, the new
one won't carry it and no DCGM pod will schedule there until you re-label.
- **Durable:** install the NVIDIA GPU Operator / device-plugin, which labels GPU
nodes automatically (and can manage DCGM itself).

Verify the pods landed (one per GPU node) before checking Grafana:

```bash
kubectl get ds dcgm-exporter # DESIRED should match your GPU node count
kubectl get pods -l app=dcgm-exporter -o wide
```

**3. Amazon Managed Grafana** — create an Amazon Managed Grafana workspace
(console or `aws grafana create-workspace`) with the **Amazon Managed Service for
Prometheus** data-source / IAM permission enabled. In the workspace, add a
Prometheus data source pointing at the AMP query endpoint
(`https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace-id>/`) with
**SigV4 auth** turned on, then import an SGLang or DCGM dashboard. The script
prints the workspace id and remote-write URL when it finishes.
96 changes: 96 additions & 0 deletions examples/inference/sglang/dcgm-exporter-daemonset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#
# DCGM Exporter DaemonSet — collects NVIDIA GPU metrics (utilization, memory,
# power, temperature, NVLink/PCIe bandwidth) on every GPU node and exposes
# them on :9400 for Prometheus scraping. Generic; no per-model values.
#
# kubectl apply -f dcgm-exporter-daemonset.yaml
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
labels:
app: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
dcgm-metrics: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
prometheus.io/path: "/metrics"
spec:
# only schedule onto GPU nodes
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: dcgm-exporter
image: nvidia/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
ports:
- name: metrics
containerPort: 9400
hostPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCGM runs as root — add a one-line rationale

runAsNonRoot: false + runAsUser: 0 is the standard DCGM requirement (host GPU/driver access), so this is likely fine — but the checklist asks for a rationale comment so a future reader doesn't "fix" it. A short # DCGM requires root for host GPU access above the securityContext does it.

capabilities:
add:
- SYS_ADMIN
env:
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
volumeMounts:
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
readOnly: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
livenessProbe:
httpGet:
path: /health
port: 9400
initialDelaySeconds: 45
periodSeconds: 15
readinessProbe:
httpGet:
path: /health
port: 9400
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
labels:
app: dcgm-exporter
spec:
selector:
app: dcgm-exporter
ports:
- name: metrics
port: 9400
targetPort: 9400
clusterIP: None
69 changes: 69 additions & 0 deletions examples/inference/sglang/download-model-daemonset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#
# Generic model pre-stager: downloads a Hugging Face repo to every matching
# node's local NVMe (/opt/dlami/nvme) so the serving pods read weights from
# fast local disk instead of pulling them at startup.
#
# Rendered and applied by download-model.sh, which fills INSTANCE_TYPE,
# HF_REPO_ID, and LOCAL_DIR_NAME. To apply by hand instead:
# export INSTANCE_TYPE=ml.p5en.48xlarge HF_REPO_ID=moonshotai/Kimi-K2.5 \
# LOCAL_DIR_NAME=moonshotai-Kimi-K2.5
# envsubst '${INSTANCE_TYPE} ${HF_REPO_ID} ${LOCAL_DIR_NAME}' \
# < download-model-daemonset.yaml | kubectl apply -f -
# # wait until each pod logs "Download complete!", then deploy the engine
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: model-downloader
labels:
app: model-downloader
spec:
selector:
matchLabels:
app: model-downloader
template:
metadata:
labels:
app: model-downloader
spec:
nodeSelector:
node.kubernetes.io/instance-type: ${INSTANCE_TYPE}
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: download
image: python:3.11-slim
command: ["/bin/bash", "-c"]
args:
- |
pip install -q huggingface_hub &&
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='${HF_REPO_ID}',
local_dir='/nvme/${LOCAL_DIR_NAME}',
local_dir_use_symlinks=False,
)
print('Download complete!')
" &&
echo "Model downloaded to /opt/dlami/nvme/${LOCAL_DIR_NAME} on node $(hostname)" &&
sleep infinity
volumeMounts:
- name: nvme
mountPath: /nvme
resources:
requests:
cpu: "1"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumes:
- name: nvme
hostPath:
path: /opt/dlami/nvme
type: DirectoryOrCreate
restartPolicy: Always
44 changes: 44 additions & 0 deletions examples/inference/sglang/download-model.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#
# Render download-model-daemonset.yaml for a given model + node type and apply
# it, pre-staging the weights to every matching node's NVMe (/opt/dlami/nvme).
#
# Usage:
# ./download-model.sh <HF_REPO_ID> <INSTANCE_TYPE> [LOCAL_DIR_NAME]
#
# Examples:
# ./download-model.sh moonshotai/Kimi-K2.5 ml.p5en.48xlarge
# ./download-model.sh deepseek-ai/DeepSeek-V4-Pro ml.p6-b300.48xlarge
#
# LOCAL_DIR_NAME defaults to the repo id with '/' replaced by '-'
# (e.g. moonshotai/Kimi-K2.5 -> moonshotai-Kimi-K2.5). The weights land at
# /opt/dlami/nvme/<LOCAL_DIR_NAME> on each node.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

if [[ $# -lt 2 ]]; then
echo "Usage: $0 <HF_REPO_ID> <INSTANCE_TYPE> [LOCAL_DIR_NAME]" >&2
exit 1
fi

export HF_REPO_ID="$1"
export INSTANCE_TYPE="$2"
export LOCAL_DIR_NAME="${3:-${HF_REPO_ID//\//-}}"

echo "==> Pre-staging ${HF_REPO_ID}"
echo " nodes: ${INSTANCE_TYPE}"
echo " target: /opt/dlami/nvme/${LOCAL_DIR_NAME}"

envsubst '${INSTANCE_TYPE} ${HF_REPO_ID} ${LOCAL_DIR_NAME}' \
< "${SCRIPT_DIR}/download-model-daemonset.yaml" \
| kubectl apply -f -

echo
echo "==> Applied. Watch progress with:"
echo " kubectl logs -f -l app=model-downloader"
echo " Each node prints 'Download complete!' when its copy is staged."
echo " Remove the downloader once done: kubectl delete daemonset model-downloader"
66 changes: 66 additions & 0 deletions examples/inference/sglang/dsv4pro-b300-single-node/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
<!--
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0
-->

# DeepSeek V4 Pro — Unified on B300 (EKS / HyperPod)

Single-node, non-disaggregated SGLang serving of **DeepSeek V4 Pro** on one
B300 node. One engine spans all 8 GPUs (`tp=8, dp=8, --enable-dp-attention`,
MXFP4 MoE, EAGLE speculative decoding).

## Deploy

```bash
kubectl apply -f dsv4pro-deploy.yaml
kubectl rollout status deploy/dsv4pro-unified
```

Targets `ml.p6-b300.48xlarge` nodes (`nodeSelector` in the manifest).

OpenAI-compatible endpoint on `dsv4pro:30000` (`ClusterIP`) — port-forward to
call it:

```bash
kubectl port-forward svc/dsv4pro 30000:30000
curl http://localhost:30000/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model": "deepseek-ai/DeepSeek-V4-Pro", "prompt": "The capital of France is", "max_tokens": 32}'
```

Tear down with `kubectl delete -f dsv4pro-deploy.yaml`.

## Benchmark

```bash
kubectl exec deploy/dsv4pro-unified -- \
python3 -m sglang.bench_serving --backend sglang \
--dataset-name random --num-prompts 1000 \
--random-input 2048 --random-output 256 \
--request-rate inf --max-concurrency 25
```

Reference numbers (`random`, input 2048 / output 256, `--request-rate inf`):

| Concurrency | Req/s | Total tok/s | Output tok/s | Median TTFT | Median TPOT | Mean E2E |
|---:|---:|---:|---:|---:|---:|---:|
| 25 | 2.56 | 2,953 | 329.6 | 396 ms | 56 ms | 9.7 s |
| 50 | 4.28 | 4,946 | 552.1 | 407 ms | 84 ms | 11.6 s |
| 75 | 5.2 | 6,003 | 670.1 | 418 ms | 105 ms | 14.3 s |
| 100 | 6.45 | 7,452 | 831.9 | 475 ms | 119 ms | 15.3 s |
| 150 | 7.77 | 8,974 | 1,001.8 | 500 ms | 141 ms | 18.9 s |
| 200 | 9.99 | 11,535 | 1,287.6 | 592 ms | 158 ms | 19.5 s |
| 300 | 12.95 | 14,954 | 1,669.3 | 4.4 s | 143 ms | 22.0 s |
| 500 | 14.16 | 16,347 | 1,824.7 | 16.8 s | 135 ms | 30.5 s |

Throughput keeps climbing to ~16k tok/s around concurrency 500, but TTFT
degrades sharply past ~300 concurrent requests on a single node.

All model and tuning knobs (env vars + serve flags) live inline in
[`dsv4pro-deploy.yaml`](./dsv4pro-deploy.yaml). Weights load from the node's
NVMe at `/opt/dlami/nvme/huggingface` — optionally pre-stage them with the
shared [`../download-model.sh`](..):

```bash
../download-model.sh deepseek-ai/DeepSeek-V4-Pro ml.p6-b300.48xlarge
```
Loading