Add Sglang Inference Examples #1128
Conversation
- Switch base image from the rolling `lmsysorg/sglang:dev-cu13` nightly to the pinned `lmsysorg/sglang:v0.5.12.post1-cu130` release. The nightly shipped NIXL 1.2.0, whose LIBFABRIC GPU HMEM path made prefill->decode KV cache transfer unreliable; the release pins NIXL 1.1.0, which transfers correctly. - Fix build-image.sh: dockerfilename was `dockerfile` (lowercase), which fails to match `Dockerfile` on case-sensitive Linux, so the image never rebuilt against the edited Dockerfile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the Prometheus-agent → Amazon Managed Prometheus (AMP) setup out of the kimi2.6-h200-1p1d deploy YAML and into shared, reusable files at the sglang/ level so any sample can opt in: - prometheus-agent-amp.yaml: in-cluster Prometheus agent that scrapes sglang-metrics / dcgm-metrics pods and remote-writes to AMP via SigV4. - setup-amp-monitoring.sh: idempotent one-shot — create/reuse AMP workspace, enable OIDC, create the ingest IAM role + ServiceAccount, render and apply the agent manifest. - README: rewrite the "GPU metrics" section into a full Monitoring section (AMP agent, DCGM exporter, Amazon Managed Grafana data source). Document the nvidia.com/gpu.present node-label prerequisite — HyperPod nodes don't carry it by default, so the DCGM DaemonSet stays at DESIRED 0 until labeled. - kimi2.6-h200-1p1d: drop the inlined Prometheus-agent block (-193 lines) and point the README at the shared manifests instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…issue Add Indicative results (placeholder), Prerequisites (Cluster/Software), and a Quick start, matching the dsv3-uccl-nixl README structure. Correct the Dockerfile base image (v0.5.12.post1-cu130, not dev-cu13) and drop the stale download-model repo_id edit step. Note that the SGLang nightly's NIXL 1.2.0 breaks prefill->decode KV-cache transfer over EFA, which is why the image is pinned to v0.5.12.post1 (NIXL 1.1.0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/5 — Structure & Repository Hygiene
Themed batch covering structure, reuse, and image hygiene. Two inline comments accompany this batch (the :latest tags); the rest are cross-cutting and live here in the body.
Observability duplicates an existing repo asset — reuse it instead of shipping a parallel stack
Files: prometheus-agent-amp.yaml, setup-amp-monitoring.sh, dcgm-exporter-daemonset.yaml, READMEs
This PR adds ~400 lines of bespoke monitoring (in-cluster Prometheus agent → AMP via SigV4, an IAM/workspace setup script, a DCGM DaemonSet). The repo already solves this at 4.validation_and_observability/4.prometheus-grafana/eks-managed-observability/ via an ADOT collector (adot-collector-prometheus.yaml, deploy-obs.sh/cleanup-obs.sh, DCGM configs, dashboards). Per the checklist's "Reuse existing assets," I'd drop the bespoke files and have the READMEs link to that as a prerequisite.
Two even-lower-maintenance managed paths, which I'd recommend over both:
- Plain EKS: the AWS managed collector / agentless scraper (
aws amp create-scraper) — AWS runs the scraper outside the cluster (no in-cluster agent to deploy/patch/HA). It's the only option in the EKS console's "Turn on Prometheus metrics" wizard, and covers everything this agent does for SGLang serving. - HyperPod EKS: the SageMaker HyperPod observability add-on — managed DCGM/metrics + AMP + Grafana dashboards (add one scrape job for SGLang's
:30000/metrics).
Adopting any managed path also retires four findings on prometheus-agent-amp.yaml (the server-mode bug, the :latest pin, the DCGM root context, and the monitoring share of the privileged concern).
Use one shared SGLang image across all three models instead of three divergent ones
Files: dsv4pro-b300-single-node/dsv4pro-deploy.yaml, qwen3.5-27b-b300-intra-pd/qwen-pd-deploy.yaml, kimi2.6-h200-1p1d/{Dockerfile,build-image.sh,kimi-pd-deploy.yaml}
The three examples ship three different engine images for the same SGLang runtime: Qwen → lmsysorg/sglang:v0.5.12.post1-cu130 (pinned, no EFA layer); Kimi → a custom ECR build (pinned base + EFA installer); DeepSeek → lmsysorg/sglang:deepseek-v4-b300 (opaque, unversioned tag). The model is a runtime arg (--model-path), so one image serves all three.
deepseek-v4-b300 is worth flagging on its own: it's not a version, it was pushed 2026-04-29 — a month before the v0.5.12.post1-cu130 release the PR pins to, and it's larger (18.1 GB vs 13.0 GB). So the DeepSeek example runs an older, non-pinned build with an unknown NIXL version, quietly contradicting the PR's "pinned for NIXL 1.1.0" rationale.
Suggestion: promote the Kimi Dockerfile (pinned base + EFA installer) to the shared examples/inference/sglang/ level, build once, and have all three manifests reference it — selecting the model via --model-path. Bonus: gives Qwen the EFA/LIBFABRIC path it currently lacks. If deepseek-v4-b300 carries a needed patch, please state what.
build-image.sh is missing its shebang and the MIT-0 license header
File: kimi2.6-h200-1p1d/build-image.sh
It starts straight at algorithm_name=sgl-dev-cu13 — no #!/usr/bin/env bash, no copyright header (and no set -euo pipefail, see the Deployment batch). The sibling download-model.sh / setup-amp-monitoring.sh get this right. Suggested top-of-file:
#!/usr/bin/env bash
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
set -euo pipefail| --enable-metrics | ||
| # ---- PD router (shares pod network → reaches engines on 127.0.0.1) --- | ||
| - name: router | ||
| image: lmsysorg/sgl-model-gateway:latest |
There was a problem hiding this comment.
:latest image tag on the router sidecar
lmsysorg/sgl-model-gateway:latest (and prom/prometheus:latest, and the DeepSeek router) pulls a moving tag. The repo convention is fixed tags everywhere so deployments are reproducible and air-gapped clusters don't break — I'd pin to a released tag (ideally a digest). The SGLang engine image is already pinned correctly; only the sidecars float.
| serviceAccountName: amp-iamproxy-ingest-service-account | ||
| containers: | ||
| - name: prometheus | ||
| image: prom/prometheus:latest |
There was a problem hiding this comment.
prom/prometheus:latest
Same :latest concern — pin Prometheus to a released tag so the agent is reproducible. (This pin is also the root cause of the agent-runs-in-server-mode bug flagged in the Deployment batch: :latest is now 3.x, where the --enable-feature=agent flag was removed.)
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/5 — Deployment Pipeline & K8s Operational Correctness
Operational-correctness batch. Three inline comments accompany it (the Prometheus server-mode bug, the Qwen UCX note, the Kimi imagePullPolicy); the cross-cutting items are below.
nodeSelector pins the ml.-prefixed instance type — only matches HyperPod, not plain EKS
Files: kimi-pd-deploy.yaml (91, 200), dsv4pro-deploy.yaml (48), qwen-pd-deploy.yaml (51), download-model-daemonset.yaml (31, via ${INSTANCE_TYPE})
Every serving manifest pins node.kubernetes.io/instance-type: ml.p5en.48xlarge / ml.p6-b300.48xlarge. That ml. prefix is the HyperPod instance-group form; a plain EKS managed nodegroup labels the same key with the bare EC2 type (p6-b300.48xlarge). I verified the bare form on a real B300 EKS node and a HyperPod EKS system node — the pod sits Pending on plain EKS until changed. Since the READMEs advertise "EKS / HyperPod EKS," it should run on both. A nodeSelector can't express OR; nodeAffinity with In can (same key, two values):
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p6-b300.48xlarge # plain EKS managed nodegroup
- ml.p6-b300.48xlarge # SageMaker HyperPod instance group(Use the p5en pair for Kimi.) I confirmed this scheduled+served on a plain-EKS B300 node. No single cross-environment label is reliable (sagemaker.amazonaws.com/instance-type is HyperPod-only; nvidia.com/gpu.product needs the GPU Operator, which HyperPod doesn't run by default).
A DaemonSet is the wrong primitive for model-weight download
Files: download-model-daemonset.yaml, download-model.sh
A DaemonSet can't model a one-shot "download once and stop": its pods must use restartPolicy: Always, so a download that exits 0 is restarted in a loop, and on node scale-out/replacement it re-downloads — with no completion signal beyond a manual kubectl delete daemonset. It's also novel here: on this base branch every model/dataset download is a Job or script (examples/training/{verl,nemo-rl,optimum-neuron}); the only DaemonSets are the EFA/health exporters. The sibling examples/inference/vllm/dsv3-uccl-nixl (which the Kimi README modeled on) uses no downloader at all — the serving pod downloads on first start to HF_HOME on local NVMe, with FSx for Lustre documented as the alternative (~680 GB loads in ~5 min local; +1–2 min on FSx first node). Recommend: single-node → download-on-startup or an initContainer; 2-node Kimi → initContainer/run-once Job, or FSx (download once, mount RO on both). Local-NVMe staging itself is fine — it's the DaemonSet wrapper that's wrong; "use a Job" is not the fix (a Job can't run one-pod-per-node).
build-image.sh lacks set -euo pipefail and quotes on some expansions
File: kimi2.6-h200-1p1d/build-image.sh
Without set -euo pipefail, a failed docker build/aws ecr doesn't stop the script — it continues to docker push a stale/missing image. A couple of expansions are unquoted (--region $region, -t ${algorithm_name}). Add strict mode (see the header suggestion in the Structure batch) and quote the expansions.
No livenessProbe on the serving deployments
Files: all four serving manifests
They define a readinessProbe but no livenessProbe. A wedged engine (CUDA hang, NIXL stuck in KVPoll.WaitingForInput) stops passing readiness and drops out of rotation, but nothing restarts the pod — it sits Running forever. Add a livenessProbe hitting /health on the foreground engine port.
| - '--config.file=/etc/prometheus/prometheus.yml' | ||
| - '--storage.tsdb.path=/prometheus' | ||
| - '--storage.tsdb.retention.time=2h' | ||
| - '--enable-feature=agent' |
There was a problem hiding this comment.
Runs in server mode, not agent mode as written
prom/prometheus:latest is now 3.12.0, and Prometheus removed --enable-feature=agent at 3.0 (it became --agent). I ran it to confirm — on 3.12 the manifest logs WARN "Unknown option for --enable-feature" option=agent then Starting Prometheus Server mode=server. So the flag is silently ignored and Prometheus starts as a full server with a local TSDB (written to the emptyDir), not the lightweight agent the file's header describes. It doesn't crash (remote_write still works in server mode) so it looks healthy while doing the wrong thing. Pin a real version and fix the flag (also pin the image per the Structure batch):
| - '--enable-feature=agent' | |
| - '--agent' |
The ConfigMap (scrape_configs + remote_write, no alerting/rules) is agent-mode-compatible. Adjacent notes: replicas: 1 is a SPOF (an HA pair needs cluster+__replica__ labels or AMP bills 2×), and the WAL on emptyDir loses buffered samples on eviction — both moot if you adopt a managed path.
| capabilities: | ||
| add: ["SYS_NICE"] | ||
| env: | ||
| - name: NIXL_LOG_LEVEL |
There was a problem hiding this comment.
Benign "use LIBFABRIC" warning — worth one README line (nit)
Not a correctness issue. No SGLANG_DISAGGREGATION_NIXL_BACKEND is set, so NIXL uses UCX — which is correct here (single-node intra-node PD; the KV hop never crosses EFA). When I ran this on a B300 node it served correctly, but the engine logs 16 Amazon EFA(s) were detected, but the UCX backend was configured ... recommended to use the LIBFABRIC backend instead. A one-line README note ("intra-node PD uses UCX; the EFA-detected warning is expected/benign") saves the next person from chasing it. No manifest change needed.
| containers: | ||
| - name: sglang | ||
| image: <YOUR_ECR_IMAGE> # the URI build-image.sh printed, e.g. <account>.dkr.ecr.<region>.amazonaws.com/sgl-dev-cu13:<tag> | ||
| imagePullPolicy: Always |
There was a problem hiding this comment.
imagePullPolicy: Always forces a pull on every start
Both the prefill and decode StatefulSets use Always (also line 211), which re-pulls a multi-GB CUDA image on every pod start and breaks air-gapped clusters. The image is a pinned ECR tag, so IfNotPresent is right.
| imagePullPolicy: Always | |
| imagePullPolicy: IfNotPresent |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/5 — Infrastructure, NCCL & Container Security
Container-security and NCCL batch. One inline comment (DCGM root rationale); the rest below.
privileged: true on all SGLang serving containers
Files: dsv4pro-deploy.yaml (54), qwen-pd-deploy.yaml (58), both Kimi StatefulSets
Every engine container runs fully privileged — all capabilities, host device access, isolation off. The workload needs some elevation (EFA/RDMA pinned memory, GPUDirect via gdrdrv, SYS_NICE), but privileged: true is a much bigger hammer. I'd scope to the specific capabilities (IPC_LOCK, SYS_NICE) plus the device mounts already declared, and drop privileged. If full privilege is genuinely required for the NIXL/EFA path, a one-line comment stating why makes it a deliberate, reviewable choice.
NCCL_SOCKET_IFNAME is unset (minor, multi-node Kimi only)
File: kimi2.6-h200-1p1d/kimi-pd-deploy.yaml
Moot for single-node (TP on NVLink) and the cross-node KV hop uses NIXL not NCCL — low priority. But for the 2-node Kimi setup, if NCCL ever initializes a cross-node control channel, the default interface pick can land on the wrong NIC. Cheap guard: set NCCL_SOCKET_IFNAME=^lo (exclusion form, never positive selection like eth0). See the EFA cheatsheet.
| hostPort: 9400 | ||
| securityContext: | ||
| runAsNonRoot: false | ||
| runAsUser: 0 |
There was a problem hiding this comment.
DCGM runs as root — add a one-line rationale
runAsNonRoot: false + runAsUser: 0 is the standard DCGM requirement (host GPU/driver access), so this is likely fine — but the checklist asks for a rationale comment so a future reader doesn't "fix" it. A short # DCGM requires root for host GPU access above the securityContext does it.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 4/5 — Documentation Consistency
Documentation batch (no inline comments — both items are PR-level).
PR description lists a CloudFormation change that isn't in the diff
The PR body's change #5 says "repoint 7 CloudFormation 1-click deploy templates at the renamed awsome-distributed-ai S3 bucket," but the diff touches only files under examples/inference/sglang/ (15 files) — no CloudFormation templates. Either that work was dropped from this branch or belongs to a different PR; I'd reconcile the description with the diff so reviewers and the changelog aren't misled.
Base branch is worktree-repo-reorg, not main
This PR targets worktree-repo-reorg. If that's intentional stacking onto an in-flight reorg, all good — just worth confirming the merge target is what you want.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 5/5 — Evaluation, Positives & Sources
Final batch — evaluation framing, what's great, and the source list.
Placeholder benchmark tables — framing is correct, just don't quote them yet
The per-model READMEs ship TBD latency/throughput tables but explicitly label them "Not yet measured" and the Kimi example "[draft-1]." That honest framing is exactly right — this is "validates the deployment runs," not "reproduces a published result," so no methodology obligation is triggered. Just make sure the TBD numbers aren't quoted downstream until filled. My e2e run confirms the Qwen example serves correctly end-to-end.
Things That Look Great
- Directory placement is correct —
examples/inference/sglang/<model>/follows the RFC #1056 "library is the demo subject" rule, with shared helpers one level up. - Excellent shared-helper reuse —
download-model.sh, the DCGM daemonset, and the AMP monitoring are factored out and reused across all three examples; extracting the inlined Prometheus block into shared manifests was the right call. - Exemplary engine-image pinning with rationale —
lmsysorg/sglang:v0.5.12.post1-cu130is pinned and the README explains why (NIXL 1.1.0 vs the nightly's 1.2.0). download-model.sh/setup-amp-monitoring.shhave the shebang, MIT-0 header,set -euo pipefail, and a properly whitelistedenvsubst.- Honest results framing — "Not yet measured" / "[draft-1]" instead of fabricated numbers.
- Thorough READMEs — prerequisites, quick start, monitoring, and a genuinely useful "known issues" section (the DCGM
nvidia.com/gpu.presentnode-label caveat is a real gotcha). - It works — I deployed
qwen3.5-27b-b300-intra-pd(adapted for plain EKS) on a real B300 node; it came up2/2 Readyin ~6 min and served a correct completion through the PD router. The intra-node prefill/decode + NIXL topology is sound.
Sources
Kubernetes: node affinity, probes, imagePullPolicy, SecurityContext, DaemonSet, Job, initContainers.
AWS: AMP ingest, managed scraper, HyperPod observability add-on, ADOT on EKS, FSx for Lustre CSI.
Prometheus: agent mode feature flag, 3.0 migration. Verified live: prom/prometheus:latest == 3.12.0 → mode=server on --enable-feature=agent (2026-06-11).
Repo precedent: eks-managed-observability, dsv3-uccl-nixl, download-Job precedents under examples/training/{verl,nemo-rl,optimum-neuron}.
Several findings here are evidence-backed by a live B300 deploy and a prom/prometheus:latest flag test, plus dedicated research on weight-staging and observability patterns.
Purpose
Provide more references for LLM inference on AWS — add a set of self-contained SGLang serving examples for Amazon EKS / SageMaker HyperPod EKS, covering both prefill/decode (PD) disaggregation and unified
serving across H200 and B300 hardware.
Changes
fronted by the SGLang router. Image pinned to SGLang v0.5.12.post1. README documents the known NIXL 1.2.0 KV-transfer issue.
in-cluster Prometheus agent → Amazon Managed Prometheus, read by Amazon Managed Grafana.
Test Plan
Environment:
Test Results
▎⚠️ Not yet measured. The per-model READMEs currently carry placeholder (TBD) latency/throughput tables. Benchmark numbers (sglang.bench_serving sweep — burst RPS, P50 TTFT/TPOT, tok/s/GPU) need to be filled
▎ in before these are quoted downstream. The Kimi 1P1D example is also marked [draft-1] pending a working NIXL 1.2.0 KV-transfer path.
Directory Structure
examples/inference/
└── sglang/
├── README.md # engine overview + shared helpers
├── download-model.sh / *-daemonset.yaml
├── dcgm-exporter-daemonset.yaml
├── setup-amp-monitoring.sh / prometheus-agent-amp.yaml
├── kimi2.6-h200-1p1d/ # Dockerfile, build-image.sh, README, kimi-pd-deploy.yaml
├── dsv4pro-b300-single-node/ # README, dsv4pro-deploy.yaml
└── qwen3.5-27b-b300-intra-pd/ # README, qwen-pd-deploy.yaml
Checklist