dsv4-fp4-b300-sglang: update image to nightly#1506
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
f25519e to
cf36b0c
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26109529858 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26109534591 |
d8ca8a8 to
09875d7
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26221509538 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26221509538 |
09875d7 to
cfa7211
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26658560606 |
…e0b, refactor script, switch to megamoe
cfa7211 to
0ba92fd
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26658745339 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26658745339 |
The head node's /scratch is an NFS mount that can return a stale file handle. enroot's runtime/cache/data/temp dirs are pinned under /scratch by /etc/enroot/enroot.conf{,.d}, so on a stale mount `enroot import` cannot create its working dirs and produces no .sqsh. That surfaces downstream as a cryptic pyxis "No such file or directory: ...sqsh" on the compute node and fails the single-node canary (e.g. actions run 26658745339).
When /scratch is unusable, probe it and redirect enroot's paths to the healthy /data share for the import only. The exports stay inside the import subshell, so the salloc/srun below (and the compute node's own /scratch) are unaffected; on a healthy head node the probe passes and behavior is identical. Also fail fast if the import still can't produce a squash instead of proceeding to a doomed srun.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ort" This reverts commit f584272. /scratch has been remounted on the head node and the stale NFS handle is cleared, so the enroot temp/cache/data redirect workaround is no longer needed. Restores the original import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26691343935 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26691343935 |
…PERM The single-node import extracts under ENROOT_TEMP_PATH, which /etc/enroot/enroot.conf pins to NFS /scratch. enroot-aufs2ovlfs unpacks the image's root-owned AUFS whiteout markers into a sticky /tmp and then can't unlink them over NFS (root-squash strips the CAP_FOWNER it needs), failing with 'failed to remove aufs whiteout: Operation not permitted' and producing no .sqsh -- which then surfaces as a pyxis 'No such file or directory' on the compute node. Run the import on local disk, where the extracted files are owned by the runner user and removable. Scoped to the import subshell and cleaned up on exit, so salloc/srun and the compute node's own /scratch are unaffected. Proper fix is to point ENROOT_TEMP_PATH at local disk in enroot.conf cluster-wide; this is the no-root workaround. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701471985 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701489318 |
5 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701489318 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701489318 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701489318 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701489318 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701489318 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26784885157 |
…, switch to megamoe Migrate the 7 STP disagg recipes to the megamoe MoE backend (deepep -> megamoe, drop deepep-config) and strip obsolete SGLANG_OPT_*/SGLANG_DEEPEP env vars now defaulted upstream, mirroring the b300 migration (#1506). Clean the 5 dynamo recipes: fix container to dsv4-grace-blackwell, remove personal extra_mount and hardcoded nodelist pins so they run on CI.
…, switch to megamoe Migrate the 7 STP disagg recipes to the megamoe MoE backend (deepep -> megamoe, drop deepep-config) and strip obsolete SGLANG_OPT_*/SGLANG_DEEPEP env vars now defaulted upstream, mirroring the b300 migration (#1506). Clean the 5 dynamo recipes: fix container to dsv4-grace-blackwell, remove personal extra_mount and hardcoded nodelist pins so they run on CI.
…, switch to megamoe Migrate the 7 STP disagg recipes to the megamoe MoE backend (deepep -> megamoe, drop deepep-config) and strip obsolete SGLANG_OPT_*/SGLANG_DEEPEP env vars now defaulted upstream, mirroring the b300 migration (#1506). Clean the 5 dynamo recipes: fix container to dsv4-grace-blackwell, remove personal extra_mount and hardcoded nodelist pins so they run on CI.
Summary
deepseek-v4-b300@sha256:2fec8d...tonightly-dev-cu13-20260518-c67b2870--moe-a2a-backend deepeptomegamoe--deepep-config(not needed by megamoe)ep: 4→ep: 1(flashinfer_mxfp4 doesn't set ep=tp)Note
Medium Risk
Changes benchmark launch parameters and cluster image import behavior; affects reproducibility of B300 sglang results but not production serving paths.
Overview
Updates the DeepSeek-V4-Pro FP4 B300 sglang benchmark to a newer nightly container and aligns launch recipes with megamoe for high concurrency.
The nvidia-master config switches from the pinned
deepseek-v4-b300image tolmsysorg/sglang:nightly-dev-cu13-20260529-a8cfae0b, documents CONC-based recipe selection, and corrects the CONC=512 search point fromep: 4toep: 1so YAML matches flashinfer_mxfp4 (no implicitep=tp).dsv4_fp4_b300_sglang.shis refactored to pick profiles by CONC (1/32 TP-only; 512 DP-attn + flashinfer; 2048–8192 DP-attn +--moe-a2a-backend megamoeinstead of deepep). Stale or redundantSGLANG_OPT_*env vars,--deepep-config, and related DeepEP settings are dropped; the benchmark step upgrades transformers before serving.launch_b300-nv.shruns enroot import with temp/cache paths on local/tmpso squash import does not fail on NFS whiteout removal during image pull.perf-changelog.yamlrecords the above fordsv4-fp4-b300-sglang.Reviewed by Cursor Bugbot for commit 9f6043d. Bugbot is set up for automated code reviews on this repo. Configure here.