Skip to content

Add Sglang Inference Examples #1128

Open
ybalbert001 wants to merge 4 commits into
awslabs:worktree-repo-reorgfrom
ybalbert001:worktree-repo-reorg
Open

Add Sglang Inference Examples #1128
ybalbert001 wants to merge 4 commits into
awslabs:worktree-repo-reorgfrom
ybalbert001:worktree-repo-reorg

Conversation

@ybalbert001

Copy link
Copy Markdown

Purpose

Provide more references for LLM inference on AWS — add a set of self-contained SGLang serving examples for Amazon EKS / SageMaker HyperPod EKS, covering both prefill/decode (PD) disaggregation and unified
serving across H200 and B300 hardware.

Changes

  1. Kimi2.6 on SGLang (node-level 1P1D) — examples/inference/sglang/kimi2.6-h200-1p1d/. Prefill + decode StatefulSets across 2× ml.p5en.48xlarge (16× H200), KV cache transferred over NIXL (LIBFABRIC/EFA),
    fronted by the SGLang router. Image pinned to SGLang v0.5.12.post1. README documents the known NIXL 1.2.0 KV-transfer issue.
  2. DeepSeek V4 Pro on B300 — examples/inference/sglang/dsv4pro-b300-single-node/. Unified (non-PD) baseline on a single B300 (8 GPU) node.
  3. Qwen3.5-27B on B300 (intra-node PD) — examples/inference/sglang/qwen3.5-27b-b300-intra-pd/. 6 prefill + 2 decode in one pod on a single B300 node, NIXL, with an SGLang router sidecar.
  4. Shared SGLang helpers — model pre-staging to local NVMe (download-model.sh + daemonset), DCGM exporter daemonset, and reusable AMP monitoring (setup-amp-monitoring.sh + prometheus-agent-amp.yaml):
    in-cluster Prometheus agent → Amazon Managed Prometheus, read by Amazon Managed Grafana.
  5. Infra fix — repoint 7 CloudFormation 1-click deploy templates at the renamed awsome-distributed-ai S3 bucket.

Test Plan

Environment:

  • AWS Service: SageMaker HyperPod EKS (1.33+)
  • Instance type: ml.p5en.48xlarge (H200), ml.p6-b300.48xlarge (B300)
  • Number of nodes: 2 (Kimi 1P1D); 1 (DeepSeek V4 Pro, Qwen3.5-27B)

Test Results

⚠️ Not yet measured. The per-model READMEs currently carry placeholder (TBD) latency/throughput tables. Benchmark numbers (sglang.bench_serving sweep — burst RPS, P50 TTFT/TPOT, tok/s/GPU) need to be filled
▎ in before these are quoted downstream. The Kimi 1P1D example is also marked [draft-1] pending a working NIXL 1.2.0 KV-transfer path.

Directory Structure

examples/inference/
└── sglang/
├── README.md # engine overview + shared helpers
├── download-model.sh / *-daemonset.yaml
├── dcgm-exporter-daemonset.yaml
├── setup-amp-monitoring.sh / prometheus-agent-amp.yaml
├── kimi2.6-h200-1p1d/ # Dockerfile, build-image.sh, README, kimi-pd-deploy.yaml
├── dsv4pro-b300-single-node/ # README, dsv4pro-deploy.yaml
└── qwen3.5-27b-b300-intra-pd/ # README, qwen-pd-deploy.yaml

Checklist

  • I have read the contributing guidelines (https://github.com/awslabs/awsome-distributed-training/blob/main/CONTRIBUTING.md).
  • I am working against the latest main branch.
  • I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • The contribution is self-contained with documentation and scripts.
  • External dependencies are pinned to a specific version or tag (SGLang v0.5.12.post1; no latest).
  • A README is included or updated with prerequisites, instructions, and known issues.
  • New test cases follow the expected directory structure (#directory-structure).

Li and others added 4 commits June 9, 2026 08:06
- Switch base image from the rolling `lmsysorg/sglang:dev-cu13` nightly to
  the pinned `lmsysorg/sglang:v0.5.12.post1-cu130` release. The nightly
  shipped NIXL 1.2.0, whose LIBFABRIC GPU HMEM path made prefill->decode KV
  cache transfer unreliable; the release pins NIXL 1.1.0, which transfers
  correctly.
- Fix build-image.sh: dockerfilename was `dockerfile` (lowercase), which
  fails to match `Dockerfile` on case-sensitive Linux, so the image never
  rebuilt against the edited Dockerfile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the Prometheus-agent → Amazon Managed Prometheus (AMP) setup out of the
kimi2.6-h200-1p1d deploy YAML and into shared, reusable files at the sglang/
level so any sample can opt in:

- prometheus-agent-amp.yaml: in-cluster Prometheus agent that scrapes
  sglang-metrics / dcgm-metrics pods and remote-writes to AMP via SigV4.
- setup-amp-monitoring.sh: idempotent one-shot — create/reuse AMP workspace,
  enable OIDC, create the ingest IAM role + ServiceAccount, render and apply
  the agent manifest.
- README: rewrite the "GPU metrics" section into a full Monitoring section
  (AMP agent, DCGM exporter, Amazon Managed Grafana data source). Document the
  nvidia.com/gpu.present node-label prerequisite — HyperPod nodes don't carry
  it by default, so the DCGM DaemonSet stays at DESIRED 0 until labeled.
- kimi2.6-h200-1p1d: drop the inlined Prometheus-agent block (-193 lines) and
  point the README at the shared manifests instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…issue

Add Indicative results (placeholder), Prerequisites (Cluster/Software), and
a Quick start, matching the dsv3-uccl-nixl README structure. Correct the
Dockerfile base image (v0.5.12.post1-cu130, not dev-cu13) and drop the stale
download-model repo_id edit step. Note that the SGLang nightly's NIXL 1.2.0
breaks prefill->decode KV-cache transfer over EFA, which is why the image is
pinned to v0.5.12.post1 (NIXL 1.1.0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@KeitaW KeitaW self-requested a review June 11, 2026 01:51

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/5 — Structure & Repository Hygiene

Themed batch covering structure, reuse, and image hygiene. Two inline comments accompany this batch (the :latest tags); the rest are cross-cutting and live here in the body.

Observability duplicates an existing repo asset — reuse it instead of shipping a parallel stack

Files: prometheus-agent-amp.yaml, setup-amp-monitoring.sh, dcgm-exporter-daemonset.yaml, READMEs

This PR adds ~400 lines of bespoke monitoring (in-cluster Prometheus agent → AMP via SigV4, an IAM/workspace setup script, a DCGM DaemonSet). The repo already solves this at 4.validation_and_observability/4.prometheus-grafana/eks-managed-observability/ via an ADOT collector (adot-collector-prometheus.yaml, deploy-obs.sh/cleanup-obs.sh, DCGM configs, dashboards). Per the checklist's "Reuse existing assets," I'd drop the bespoke files and have the READMEs link to that as a prerequisite.

Two even-lower-maintenance managed paths, which I'd recommend over both:

  • Plain EKS: the AWS managed collector / agentless scraper (aws amp create-scraper) — AWS runs the scraper outside the cluster (no in-cluster agent to deploy/patch/HA). It's the only option in the EKS console's "Turn on Prometheus metrics" wizard, and covers everything this agent does for SGLang serving.
  • HyperPod EKS: the SageMaker HyperPod observability add-on — managed DCGM/metrics + AMP + Grafana dashboards (add one scrape job for SGLang's :30000/metrics).

Adopting any managed path also retires four findings on prometheus-agent-amp.yaml (the server-mode bug, the :latest pin, the DCGM root context, and the monitoring share of the privileged concern).

Use one shared SGLang image across all three models instead of three divergent ones

Files: dsv4pro-b300-single-node/dsv4pro-deploy.yaml, qwen3.5-27b-b300-intra-pd/qwen-pd-deploy.yaml, kimi2.6-h200-1p1d/{Dockerfile,build-image.sh,kimi-pd-deploy.yaml}

The three examples ship three different engine images for the same SGLang runtime: Qwen → lmsysorg/sglang:v0.5.12.post1-cu130 (pinned, no EFA layer); Kimi → a custom ECR build (pinned base + EFA installer); DeepSeek → lmsysorg/sglang:deepseek-v4-b300 (opaque, unversioned tag). The model is a runtime arg (--model-path), so one image serves all three.

deepseek-v4-b300 is worth flagging on its own: it's not a version, it was pushed 2026-04-29 — a month before the v0.5.12.post1-cu130 release the PR pins to, and it's larger (18.1 GB vs 13.0 GB). So the DeepSeek example runs an older, non-pinned build with an unknown NIXL version, quietly contradicting the PR's "pinned for NIXL 1.1.0" rationale.

Suggestion: promote the Kimi Dockerfile (pinned base + EFA installer) to the shared examples/inference/sglang/ level, build once, and have all three manifests reference it — selecting the model via --model-path. Bonus: gives Qwen the EFA/LIBFABRIC path it currently lacks. If deepseek-v4-b300 carries a needed patch, please state what.

build-image.sh is missing its shebang and the MIT-0 license header

File: kimi2.6-h200-1p1d/build-image.sh

It starts straight at algorithm_name=sgl-dev-cu13 — no #!/usr/bin/env bash, no copyright header (and no set -euo pipefail, see the Deployment batch). The sibling download-model.sh / setup-amp-monitoring.sh get this right. Suggested top-of-file:

#!/usr/bin/env bash
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
set -euo pipefail

--enable-metrics
# ---- PD router (shares pod network → reaches engines on 127.0.0.1) ---
- name: router
image: lmsysorg/sgl-model-gateway:latest

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:latest image tag on the router sidecar

lmsysorg/sgl-model-gateway:latest (and prom/prometheus:latest, and the DeepSeek router) pulls a moving tag. The repo convention is fixed tags everywhere so deployments are reproducible and air-gapped clusters don't break — I'd pin to a released tag (ideally a digest). The SGLang engine image is already pinned correctly; only the sidecars float.

serviceAccountName: amp-iamproxy-ingest-service-account
containers:
- name: prometheus
image: prom/prometheus:latest

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prom/prometheus:latest

Same :latest concern — pin Prometheus to a released tag so the agent is reproducible. (This pin is also the root cause of the agent-runs-in-server-mode bug flagged in the Deployment batch: :latest is now 3.x, where the --enable-feature=agent flag was removed.)

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/5 — Deployment Pipeline & K8s Operational Correctness

Operational-correctness batch. Three inline comments accompany it (the Prometheus server-mode bug, the Qwen UCX note, the Kimi imagePullPolicy); the cross-cutting items are below.

nodeSelector pins the ml.-prefixed instance type — only matches HyperPod, not plain EKS

Files: kimi-pd-deploy.yaml (91, 200), dsv4pro-deploy.yaml (48), qwen-pd-deploy.yaml (51), download-model-daemonset.yaml (31, via ${INSTANCE_TYPE})

Every serving manifest pins node.kubernetes.io/instance-type: ml.p5en.48xlarge / ml.p6-b300.48xlarge. That ml. prefix is the HyperPod instance-group form; a plain EKS managed nodegroup labels the same key with the bare EC2 type (p6-b300.48xlarge). I verified the bare form on a real B300 EKS node and a HyperPod EKS system node — the pod sits Pending on plain EKS until changed. Since the READMEs advertise "EKS / HyperPod EKS," it should run on both. A nodeSelector can't express OR; nodeAffinity with In can (same key, two values):

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - p6-b300.48xlarge      # plain EKS managed nodegroup
                - ml.p6-b300.48xlarge   # SageMaker HyperPod instance group

(Use the p5en pair for Kimi.) I confirmed this scheduled+served on a plain-EKS B300 node. No single cross-environment label is reliable (sagemaker.amazonaws.com/instance-type is HyperPod-only; nvidia.com/gpu.product needs the GPU Operator, which HyperPod doesn't run by default).

A DaemonSet is the wrong primitive for model-weight download

Files: download-model-daemonset.yaml, download-model.sh

A DaemonSet can't model a one-shot "download once and stop": its pods must use restartPolicy: Always, so a download that exits 0 is restarted in a loop, and on node scale-out/replacement it re-downloads — with no completion signal beyond a manual kubectl delete daemonset. It's also novel here: on this base branch every model/dataset download is a Job or script (examples/training/{verl,nemo-rl,optimum-neuron}); the only DaemonSets are the EFA/health exporters. The sibling examples/inference/vllm/dsv3-uccl-nixl (which the Kimi README modeled on) uses no downloader at all — the serving pod downloads on first start to HF_HOME on local NVMe, with FSx for Lustre documented as the alternative (~680 GB loads in ~5 min local; +1–2 min on FSx first node). Recommend: single-node → download-on-startup or an initContainer; 2-node Kimi → initContainer/run-once Job, or FSx (download once, mount RO on both). Local-NVMe staging itself is fine — it's the DaemonSet wrapper that's wrong; "use a Job" is not the fix (a Job can't run one-pod-per-node).

build-image.sh lacks set -euo pipefail and quotes on some expansions

File: kimi2.6-h200-1p1d/build-image.sh

Without set -euo pipefail, a failed docker build/aws ecr doesn't stop the script — it continues to docker push a stale/missing image. A couple of expansions are unquoted (--region $region, -t ${algorithm_name}). Add strict mode (see the header suggestion in the Structure batch) and quote the expansions.

No livenessProbe on the serving deployments

Files: all four serving manifests

They define a readinessProbe but no livenessProbe. A wedged engine (CUDA hang, NIXL stuck in KVPoll.WaitingForInput) stops passing readiness and drops out of rotation, but nothing restarts the pod — it sits Running forever. Add a livenessProbe hitting /health on the foreground engine port.

- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=2h'
- '--enable-feature=agent'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs in server mode, not agent mode as written

prom/prometheus:latest is now 3.12.0, and Prometheus removed --enable-feature=agent at 3.0 (it became --agent). I ran it to confirm — on 3.12 the manifest logs WARN "Unknown option for --enable-feature" option=agent then Starting Prometheus Server mode=server. So the flag is silently ignored and Prometheus starts as a full server with a local TSDB (written to the emptyDir), not the lightweight agent the file's header describes. It doesn't crash (remote_write still works in server mode) so it looks healthy while doing the wrong thing. Pin a real version and fix the flag (also pin the image per the Structure batch):

Suggested change
- '--enable-feature=agent'
- '--agent'

The ConfigMap (scrape_configs + remote_write, no alerting/rules) is agent-mode-compatible. Adjacent notes: replicas: 1 is a SPOF (an HA pair needs cluster+__replica__ labels or AMP bills 2×), and the WAL on emptyDir loses buffered samples on eviction — both moot if you adopt a managed path.

capabilities:
add: ["SYS_NICE"]
env:
- name: NIXL_LOG_LEVEL

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benign "use LIBFABRIC" warning — worth one README line (nit)

Not a correctness issue. No SGLANG_DISAGGREGATION_NIXL_BACKEND is set, so NIXL uses UCX — which is correct here (single-node intra-node PD; the KV hop never crosses EFA). When I ran this on a B300 node it served correctly, but the engine logs 16 Amazon EFA(s) were detected, but the UCX backend was configured ... recommended to use the LIBFABRIC backend instead. A one-line README note ("intra-node PD uses UCX; the EFA-detected warning is expected/benign") saves the next person from chasing it. No manifest change needed.

containers:
- name: sglang
image: <YOUR_ECR_IMAGE> # the URI build-image.sh printed, e.g. <account>.dkr.ecr.<region>.amazonaws.com/sgl-dev-cu13:<tag>
imagePullPolicy: Always

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagePullPolicy: Always forces a pull on every start

Both the prefill and decode StatefulSets use Always (also line 211), which re-pulls a multi-GB CUDA image on every pod start and breaks air-gapped clusters. The image is a pinned ECR tag, so IfNotPresent is right.

Suggested change
imagePullPolicy: Always
imagePullPolicy: IfNotPresent

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/5 — Infrastructure, NCCL & Container Security

Container-security and NCCL batch. One inline comment (DCGM root rationale); the rest below.

privileged: true on all SGLang serving containers

Files: dsv4pro-deploy.yaml (54), qwen-pd-deploy.yaml (58), both Kimi StatefulSets

Every engine container runs fully privileged — all capabilities, host device access, isolation off. The workload needs some elevation (EFA/RDMA pinned memory, GPUDirect via gdrdrv, SYS_NICE), but privileged: true is a much bigger hammer. I'd scope to the specific capabilities (IPC_LOCK, SYS_NICE) plus the device mounts already declared, and drop privileged. If full privilege is genuinely required for the NIXL/EFA path, a one-line comment stating why makes it a deliberate, reviewable choice.

NCCL_SOCKET_IFNAME is unset (minor, multi-node Kimi only)

File: kimi2.6-h200-1p1d/kimi-pd-deploy.yaml

Moot for single-node (TP on NVLink) and the cross-node KV hop uses NIXL not NCCL — low priority. But for the 2-node Kimi setup, if NCCL ever initializes a cross-node control channel, the default interface pick can land on the wrong NIC. Cheap guard: set NCCL_SOCKET_IFNAME=^lo (exclusion form, never positive selection like eth0). See the EFA cheatsheet.

hostPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCGM runs as root — add a one-line rationale

runAsNonRoot: false + runAsUser: 0 is the standard DCGM requirement (host GPU/driver access), so this is likely fine — but the checklist asks for a rationale comment so a future reader doesn't "fix" it. A short # DCGM requires root for host GPU access above the securityContext does it.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 4/5 — Documentation Consistency

Documentation batch (no inline comments — both items are PR-level).

PR description lists a CloudFormation change that isn't in the diff

The PR body's change #5 says "repoint 7 CloudFormation 1-click deploy templates at the renamed awsome-distributed-ai S3 bucket," but the diff touches only files under examples/inference/sglang/ (15 files) — no CloudFormation templates. Either that work was dropped from this branch or belongs to a different PR; I'd reconcile the description with the diff so reviewers and the changelog aren't misled.

Base branch is worktree-repo-reorg, not main

This PR targets worktree-repo-reorg. If that's intentional stacking onto an in-flight reorg, all good — just worth confirming the merge target is what you want.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 5/5 — Evaluation, Positives & Sources

Final batch — evaluation framing, what's great, and the source list.

Placeholder benchmark tables — framing is correct, just don't quote them yet

The per-model READMEs ship TBD latency/throughput tables but explicitly label them "Not yet measured" and the Kimi example "[draft-1]." That honest framing is exactly right — this is "validates the deployment runs," not "reproduces a published result," so no methodology obligation is triggered. Just make sure the TBD numbers aren't quoted downstream until filled. My e2e run confirms the Qwen example serves correctly end-to-end.

Things That Look Great

  • Directory placement is correctexamples/inference/sglang/<model>/ follows the RFC #1056 "library is the demo subject" rule, with shared helpers one level up.
  • Excellent shared-helper reusedownload-model.sh, the DCGM daemonset, and the AMP monitoring are factored out and reused across all three examples; extracting the inlined Prometheus block into shared manifests was the right call.
  • Exemplary engine-image pinning with rationalelmsysorg/sglang:v0.5.12.post1-cu130 is pinned and the README explains why (NIXL 1.1.0 vs the nightly's 1.2.0).
  • download-model.sh / setup-amp-monitoring.sh have the shebang, MIT-0 header, set -euo pipefail, and a properly whitelisted envsubst.
  • Honest results framing — "Not yet measured" / "[draft-1]" instead of fabricated numbers.
  • Thorough READMEs — prerequisites, quick start, monitoring, and a genuinely useful "known issues" section (the DCGM nvidia.com/gpu.present node-label caveat is a real gotcha).
  • It works — I deployed qwen3.5-27b-b300-intra-pd (adapted for plain EKS) on a real B300 node; it came up 2/2 Ready in ~6 min and served a correct completion through the PD router. The intra-node prefill/decode + NIXL topology is sound.

Sources

Kubernetes: node affinity, probes, imagePullPolicy, SecurityContext, DaemonSet, Job, initContainers.
AWS: AMP ingest, managed scraper, HyperPod observability add-on, ADOT on EKS, FSx for Lustre CSI.
Prometheus: agent mode feature flag, 3.0 migration. Verified live: prom/prometheus:latest == 3.12.0 → mode=server on --enable-feature=agent (2026-06-11).
Repo precedent: eks-managed-observability, dsv3-uccl-nixl, download-Job precedents under examples/training/{verl,nemo-rl,optimum-neuron}.

Several findings here are evidence-backed by a live B300 deploy and a prom/prometheus:latest flag test, plus dedicated research on weight-staging and observability patterns.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants