awslabs · ybalbert001 · Jun 9, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 11, 2026
diff --git a/examples/inference/README.md b/examples/inference/README.md
@@ -5,7 +5,10 @@ Framework-centric inference engine examples, organized by serving engine.
 | Engine | Example | Description |
 |---|---|---|
 | [`vllm`](./vllm) | [`dsv3-uccl-nixl`](./vllm/dsv3-uccl-nixl) | DeepSeek-V3 disaggregated (prefill/decode) inference with vLLM, UCCL-EP, and NIXL on EKS |
+| [`sglang`](./sglang) | [`qwen3.5-27b-b300-intra-pd`](./sglang/qwen3.5-27b-b300-intra-pd) | Qwen3.5-27B with intra-node prefill/decode disaggregation on a single B300 node |
+| [`sglang`](./sglang) | [`kimi2.6-h200-1p1d`](./sglang/kimi2.6-h200-1p1d) | Kimi2.6 with node-level 1-prefill / 1-decode disaggregation across two H200 nodes |
+| [`sglang`](./sglang) | [`dsv4pro-b300-single-node`](./sglang/dsv4pro-b300-single-node) | DeepSeek V4 Pro unified (non-PD) serving on a single B300 node |
 
-More engines (SGLang, TRT-LLM, NIM, Dynamo, Ray Serve, …) are planned, including
+More engines (TRT-LLM, NIM, Dynamo, Ray Serve, …) are planned, including
 content to be merged from [`aws-samples/awsome-inference`](https://github.com/aws-samples/awsome-inference)
 (see issue #1056).
diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
@@ -0,0 +1,99 @@
+<!--
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+SPDX-License-Identifier: MIT-0
+-->
+
+# SGLang test cases
+
+[SGLang](https://github.com/sgl-project/sglang) deployments on AWS EKS /
+SageMaker HyperPod. Each sub-directory is a self-contained sample — apply its
+manifest with `kubectl`.
+
+| Test case | Hardware | Topology |
+| --- | --- | --- |
+| [`qwen3.5-27b-b300-intra-pd`](./qwen3.5-27b-b300-intra-pd) | 1× B300 (8 GPU) | Intra-node PD — 6 prefill + 2 decode in one pod, NIXL, SGLang router sidecar |
+| [`kimi2.6-h200-1p1d`](./kimi2.6-h200-1p1d) | 2× H200 nodes | Node-level 1P1D — prefill + decode StatefulSets, NIXL over EFA |
+| [`dsv4pro-b300-single-node`](./dsv4pro-b300-single-node) | 1× B300 (8 GPU) | Unified (non-PD) baseline |
+
+## Shared helpers
+
+Reusable across all the samples above:
+
+### Pre-stage model weights
+
+Download a Hugging Face repo to every matching node's local NVMe
+(`/opt/dlami/nvme`) so the serving pods read weights from fast local disk
+instead of pulling them at startup. [`download-model.sh`](./download-model.sh)
+renders [`download-model-daemonset.yaml`](./download-model-daemonset.yaml) and
+applies it — `LOCAL_DIR_NAME` defaults to the repo id with `/` → `-`:
+
+```bash
+./download-model.sh moonshotai/Kimi-K2.5       ml.p5en.48xlarge
+./download-model.sh deepseek-ai/DeepSeek-V4-Pro ml.p6-b300.48xlarge
+# watch: kubectl logs -f -l app=model-downloader   (each node prints "Download complete!")
+# then:  kubectl delete daemonset model-downloader
+```
+
+### Monitoring (Prometheus + Grafana)
+
+The serving pods already expose SGLang metrics on `:30000/metrics` (started with
+`--enable-metrics`) and carry the `sglang-metrics=true` label plus the
+`prometheus.io/*` scrape annotations. The monitoring path is fully AWS-managed:
+an in-cluster Prometheus **agent** remote-writes to **Amazon Managed Prometheus
+(AMP)**, and **Amazon Managed Grafana** reads from AMP — there is no in-cluster
+Grafana.
+
+**1. AMP + Prometheus agent (scripted)** —
+[`setup-amp-monitoring.sh`](./setup-amp-monitoring.sh) is idempotent and does the
+three one-time steps in order: create (or reuse) an AMP workspace, enable the
+cluster OIDC provider and create the AMP ingest IAM role bound to the
+`amp-iamproxy-ingest-service-account` ServiceAccount, then render
+[`prometheus-agent-amp.yaml`](./prometheus-agent-amp.yaml) with the real
+workspace id / role ARN / region and apply it.
+
+```bash
+./setup-amp-monitoring.sh <CLUSTER_NAME> [REGION] [AMP_ALIAS]
+# e.g. ./setup-amp-monitoring.sh eks-hypd-0512-b2ad us-west-2 sglang-kimi
+# then watch the agent leave CrashLoopBackOff:
+#   kubectl rollout status deployment/prometheus-agent
+```
+
+The agent scrapes every pod labeled `sglang-metrics=true` or `dcgm-metrics=true`
+and remote-writes via SigV4. (Requires `awscli`, `eksctl`, `kubectl`, `envsubst`
+and AWS creds with AMP + IAM permissions.)
+
+**2. GPU metrics** — [`dcgm-exporter-daemonset.yaml`](./dcgm-exporter-daemonset.yaml)
+runs a DCGM exporter DaemonSet on `:9400` (labeled `dcgm-metrics=true`, so the
+agent above picks it up automatically). Apply it:
+
+```bash
+kubectl apply -f dcgm-exporter-daemonset.yaml
+```
+
+The manifest schedules onto nodes labeled `nvidia.com/gpu.present=true`. This
+label is **not** present by default on SageMaker HyperPod nodes — it is the
+NVIDIA GPU Operator convention, and HyperPod doesn't run the Operator. So on a
+plain HyperPod cluster the DaemonSet comes up with `DESIRED 0` and never starts
+a pod. Two ways to fix it:
+
+- **Quick:** label the GPU nodes by hand —
+  `kubectl label nodes <node>... nvidia.com/gpu.present=true`. Simple, but the
+  label does **not** survive node replacement: if HyperPod swaps a node, the new
+  one won't carry it and no DCGM pod will schedule there until you re-label.
+- **Durable:** install the NVIDIA GPU Operator / device-plugin, which labels GPU
+  nodes automatically (and can manage DCGM itself).
+
+Verify the pods landed (one per GPU node) before checking Grafana:
+
+```bash
+kubectl get ds dcgm-exporter            # DESIRED should match your GPU node count
+kubectl get pods -l app=dcgm-exporter -o wide
+```
+
+**3. Amazon Managed Grafana** — create an Amazon Managed Grafana workspace
+(console or `aws grafana create-workspace`) with the **Amazon Managed Service for
+Prometheus** data-source / IAM permission enabled. In the workspace, add a
+Prometheus data source pointing at the AMP query endpoint
+(`https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace-id>/`) with
+**SigV4 auth** turned on, then import an SGLang or DCGM dashboard. The script
+prints the workspace id and remote-write URL when it finishes.
diff --git a/examples/inference/sglang/dcgm-exporter-daemonset.yaml b/examples/inference/sglang/dcgm-exporter-daemonset.yaml
@@ -0,0 +1,96 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# DCGM Exporter DaemonSet — collects NVIDIA GPU metrics (utilization, memory,
+# power, temperature, NVLink/PCIe bandwidth) on every GPU node and exposes
+# them on :9400 for Prometheus scraping. Generic; no per-model values.
+#
+#   kubectl apply -f dcgm-exporter-daemonset.yaml
+---
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: dcgm-exporter
+  labels:
+    app: dcgm-exporter
+spec:
+  selector:
+    matchLabels:
+      app: dcgm-exporter
+  template:
+    metadata:
+      labels:
+        app: dcgm-exporter
+        dcgm-metrics: "true"
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "9400"
+        prometheus.io/path: "/metrics"
+    spec:
+      # only schedule onto GPU nodes
+      nodeSelector:
+        nvidia.com/gpu.present: "true"
+      tolerations:
+      - key: nvidia.com/gpu
+        operator: Exists
+        effect: NoSchedule
+      containers:
+      - name: dcgm-exporter
+        image: nvidia/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
+        ports:
+        - name: metrics
+          containerPort: 9400
+          hostPort: 9400
+        securityContext:
+          runAsNonRoot: false
+          runAsUser: 0
+          capabilities:
+            add:
+            - SYS_ADMIN
+        env:
+        - name: DCGM_EXPORTER_KUBERNETES
+          value: "true"
+        - name: DCGM_EXPORTER_LISTEN
+          value: ":9400"
+        volumeMounts:
+        - name: pod-resources
+          mountPath: /var/lib/kubelet/pod-resources
+          readOnly: true
+        resources:
+          requests:
+            cpu: 100m
+            memory: 128Mi
+          limits:
+            cpu: 200m
+            memory: 256Mi
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 9400
+          initialDelaySeconds: 45
+          periodSeconds: 15
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 9400
+          initialDelaySeconds: 30
+          periodSeconds: 10
+      volumes:
+      - name: pod-resources
+        hostPath:
+          path: /var/lib/kubelet/pod-resources
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: dcgm-exporter
+  labels:
+    app: dcgm-exporter
+spec:
+  selector:
+    app: dcgm-exporter
+  ports:
+  - name: metrics
+    port: 9400
+    targetPort: 9400
+  clusterIP: None
diff --git a/examples/inference/sglang/download-model-daemonset.yaml b/examples/inference/sglang/download-model-daemonset.yaml
@@ -0,0 +1,69 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# Generic model pre-stager: downloads a Hugging Face repo to every matching
+# node's local NVMe (/opt/dlami/nvme) so the serving pods read weights from
+# fast local disk instead of pulling them at startup.
+#
+# Rendered and applied by download-model.sh, which fills INSTANCE_TYPE,
+# HF_REPO_ID, and LOCAL_DIR_NAME. To apply by hand instead:
+#   export INSTANCE_TYPE=ml.p5en.48xlarge HF_REPO_ID=moonshotai/Kimi-K2.5 \
+#          LOCAL_DIR_NAME=moonshotai-Kimi-K2.5
+#   envsubst '${INSTANCE_TYPE} ${HF_REPO_ID} ${LOCAL_DIR_NAME}' \
+#     < download-model-daemonset.yaml | kubectl apply -f -
+#   # wait until each pod logs "Download complete!", then deploy the engine
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: model-downloader
+  labels:
+    app: model-downloader
+spec:
+  selector:
+    matchLabels:
+      app: model-downloader
+  template:
+    metadata:
+      labels:
+        app: model-downloader
+    spec:
+      nodeSelector:
+        node.kubernetes.io/instance-type: ${INSTANCE_TYPE}
+      tolerations:
+        - key: nvidia.com/gpu
+          operator: Exists
+          effect: NoSchedule
+      containers:
+        - name: download
+          image: python:3.11-slim
+          command: ["/bin/bash", "-c"]
+          args:
+            - |
+              pip install -q huggingface_hub &&
+              python3 -c "
+              from huggingface_hub import snapshot_download
+              snapshot_download(
+                  repo_id='${HF_REPO_ID}',
+                  local_dir='/nvme/${LOCAL_DIR_NAME}',
+                  local_dir_use_symlinks=False,
+              )
+              print('Download complete!')
+              " &&
+              echo "Model downloaded to /opt/dlami/nvme/${LOCAL_DIR_NAME} on node $(hostname)" &&
+              sleep infinity
+          volumeMounts:
+            - name: nvme
+              mountPath: /nvme
+          resources:
+            requests:
+              cpu: "1"
+              memory: "4Gi"
+            limits:
+              cpu: "4"
+              memory: "8Gi"
+      volumes:
+        - name: nvme
+          hostPath:
+            path: /opt/dlami/nvme
+            type: DirectoryOrCreate
+      restartPolicy: Always
diff --git a/examples/inference/sglang/download-model.sh b/examples/inference/sglang/download-model.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# Render download-model-daemonset.yaml for a given model + node type and apply
+# it, pre-staging the weights to every matching node's NVMe (/opt/dlami/nvme).
+#
+# Usage:
+#   ./download-model.sh <HF_REPO_ID> <INSTANCE_TYPE> [LOCAL_DIR_NAME]
+#
+# Examples:
+#   ./download-model.sh moonshotai/Kimi-K2.5  ml.p5en.48xlarge
+#   ./download-model.sh deepseek-ai/DeepSeek-V4-Pro ml.p6-b300.48xlarge
+#
+# LOCAL_DIR_NAME defaults to the repo id with '/' replaced by '-'
+# (e.g. moonshotai/Kimi-K2.5 -> moonshotai-Kimi-K2.5). The weights land at
+# /opt/dlami/nvme/<LOCAL_DIR_NAME> on each node.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+if [[ $# -lt 2 ]]; then
+    echo "Usage: $0 <HF_REPO_ID> <INSTANCE_TYPE> [LOCAL_DIR_NAME]" >&2
+    exit 1
+fi
+
+export HF_REPO_ID="$1"
+export INSTANCE_TYPE="$2"
+export LOCAL_DIR_NAME="${3:-${HF_REPO_ID//\//-}}"
+
+echo "==> Pre-staging ${HF_REPO_ID}"
+echo "    nodes:  ${INSTANCE_TYPE}"
+echo "    target: /opt/dlami/nvme/${LOCAL_DIR_NAME}"
+
+envsubst '${INSTANCE_TYPE} ${HF_REPO_ID} ${LOCAL_DIR_NAME}' \
+    < "${SCRIPT_DIR}/download-model-daemonset.yaml" \
+    | kubectl apply -f -
+
+echo
+echo "==> Applied. Watch progress with:"
+echo "    kubectl logs -f -l app=model-downloader"
+echo "    Each node prints 'Download complete!' when its copy is staged."
+echo "    Remove the downloader once done: kubectl delete daemonset model-downloader"
diff --git a/examples/inference/sglang/dsv4pro-b300-single-node/README.md b/examples/inference/sglang/dsv4pro-b300-single-node/README.md
@@ -0,0 +1,66 @@
+<!--
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+SPDX-License-Identifier: MIT-0
+-->
+
+# DeepSeek V4 Pro — Unified on B300 (EKS / HyperPod)
+
+Single-node, non-disaggregated SGLang serving of **DeepSeek V4 Pro** on one
+B300 node. One engine spans all 8 GPUs (`tp=8, dp=8, --enable-dp-attention`,
+MXFP4 MoE, EAGLE speculative decoding).
+
+## Deploy
+
+```bash
+kubectl apply -f dsv4pro-deploy.yaml
+kubectl rollout status deploy/dsv4pro-unified
+```
+
+Targets `ml.p6-b300.48xlarge` nodes (`nodeSelector` in the manifest).
+
+OpenAI-compatible endpoint on `dsv4pro:30000` (`ClusterIP`) — port-forward to
+call it:
+
+```bash
+kubectl port-forward svc/dsv4pro 30000:30000
+curl http://localhost:30000/v1/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"model": "deepseek-ai/DeepSeek-V4-Pro", "prompt": "The capital of France is", "max_tokens": 32}'
+```
+
+Tear down with `kubectl delete -f dsv4pro-deploy.yaml`.
+
+## Benchmark
+
+```bash
+kubectl exec deploy/dsv4pro-unified -- \
+  python3 -m sglang.bench_serving --backend sglang \
+    --dataset-name random --num-prompts 1000 \
+    --random-input 2048 --random-output 256 \
+    --request-rate inf --max-concurrency 25
+```
+
+Reference numbers (`random`, input 2048 / output 256, `--request-rate inf`):
+
+| Concurrency | Req/s | Total tok/s | Output tok/s | Median TTFT | Median TPOT | Mean E2E |
+|---:|---:|---:|---:|---:|---:|---:|
+| 25  | 2.56  | 2,953  | 329.6   | 396 ms  | 56 ms  | 9.7 s  |
+| 50  | 4.28  | 4,946  | 552.1   | 407 ms  | 84 ms  | 11.6 s |
+| 75  | 5.2   | 6,003  | 670.1   | 418 ms  | 105 ms | 14.3 s |
+| 100 | 6.45  | 7,452  | 831.9   | 475 ms  | 119 ms | 15.3 s |
+| 150 | 7.77  | 8,974  | 1,001.8 | 500 ms  | 141 ms | 18.9 s |
+| 200 | 9.99  | 11,535 | 1,287.6 | 592 ms  | 158 ms | 19.5 s |
+| 300 | 12.95 | 14,954 | 1,669.3 | 4.4 s   | 143 ms | 22.0 s |
+| 500 | 14.16 | 16,347 | 1,824.7 | 16.8 s  | 135 ms | 30.5 s |
+
+Throughput keeps climbing to ~16k tok/s around concurrency 500, but TTFT
+degrades sharply past ~300 concurrent requests on a single node.
+
+All model and tuning knobs (env vars + serve flags) live inline in
+[`dsv4pro-deploy.yaml`](./dsv4pro-deploy.yaml). Weights load from the node's
+NVMe at `/opt/dlami/nvme/huggingface` — optionally pre-stage them with the
+shared [`../download-model.sh`](..):
+
+```bash
+../download-model.sh deepseek-ai/DeepSeek-V4-Pro ml.p6-b300.48xlarge
+```