Skip to content

dsv4-fp4-b300-sglang: update image to nightly#1506

Merged
Oseltamivir merged 8 commits into
mainfrom
yyh/update-dsv4-b300-sglang-image
Jun 1, 2026
Merged

dsv4-fp4-b300-sglang: update image to nightly#1506
Oseltamivir merged 8 commits into
mainfrom
yyh/update-dsv4-b300-sglang-image

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented May 18, 2026

Summary

  • Update image from deepseek-v4-b300@sha256:2fec8d... to nightly-dev-cu13-20260518-c67b2870
  • Refactor benchmark script to dispatch by CONC instead of nested DP_ATTENTION/CONC/EP_SIZE
  • Switch high-concurrency profiles (CONC 2048/4096/8192) from --moe-a2a-backend deepep to megamoe
  • Remove env vars deleted from sglang main or redundant with defaults
  • Remove --deepep-config (not needed by megamoe)
  • Fix CONC=512 yaml ep: 4ep: 1 (flashinfer_mxfp4 doesn't set ep=tp)

Note

Medium Risk
Changes benchmark launch parameters and cluster image import behavior; affects reproducibility of B300 sglang results but not production serving paths.

Overview
Updates the DeepSeek-V4-Pro FP4 B300 sglang benchmark to a newer nightly container and aligns launch recipes with megamoe for high concurrency.

The nvidia-master config switches from the pinned deepseek-v4-b300 image to lmsysorg/sglang:nightly-dev-cu13-20260529-a8cfae0b, documents CONC-based recipe selection, and corrects the CONC=512 search point from ep: 4 to ep: 1 so YAML matches flashinfer_mxfp4 (no implicit ep=tp).

dsv4_fp4_b300_sglang.sh is refactored to pick profiles by CONC (1/32 TP-only; 512 DP-attn + flashinfer; 2048–8192 DP-attn + --moe-a2a-backend megamoe instead of deepep). Stale or redundant SGLANG_OPT_* env vars, --deepep-config, and related DeepEP settings are dropped; the benchmark step upgrades transformers before serving.

launch_b300-nv.sh runs enroot import with temp/cache paths on local /tmp so squash import does not fail on NFS whiteout removal during image pull.

perf-changelog.yaml records the above for dsv4-fp4-b300-sglang.

Reviewed by Cursor Bugbot for commit 9f6043d. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@yhyang201 yhyang201 changed the title dsv4-fp4-b300-sglang: update image to nightly, switch to megamoe dsv4-fp4-b300-sglang: update image to nightly May 18, 2026
@yhyang201 yhyang201 force-pushed the yyh/update-dsv4-b300-sglang-image branch from f25519e to cf36b0c Compare May 19, 2026 15:32
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the yyh/update-dsv4-b300-sglang-image branch from d8ca8a8 to 09875d7 Compare May 21, 2026 10:52
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

yhyang201 added a commit that referenced this pull request May 29, 2026
@yhyang201 yhyang201 force-pushed the yyh/update-dsv4-b300-sglang-image branch from 09875d7 to cfa7211 Compare May 29, 2026 19:45
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the yyh/update-dsv4-b300-sglang-image branch from cfa7211 to 0ba92fd Compare May 29, 2026 19:49
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

The head node's /scratch is an NFS mount that can return a stale file handle. enroot's runtime/cache/data/temp dirs are pinned under /scratch by /etc/enroot/enroot.conf{,.d}, so on a stale mount `enroot import` cannot create its working dirs and produces no .sqsh. That surfaces downstream as a cryptic pyxis "No such file or directory: ...sqsh" on the compute node and fails the single-node canary (e.g. actions run 26658745339).

When /scratch is unusable, probe it and redirect enroot's paths to the healthy /data share for the import only. The exports stay inside the import subshell, so the salloc/srun below (and the compute node's own /scratch) are unaffected; on a healthy head node the probe passes and behavior is identical. Also fail fast if the import still can't produce a squash instead of proceeding to a doomed srun.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Oseltamivir and others added 2 commits May 30, 2026 10:21
…ort"

This reverts commit f584272. /scratch has been remounted on the head node and the stale NFS handle is cleared, so the enroot temp/cache/data redirect workaround is no longer needed. Restores the original import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

…PERM

The single-node import extracts under ENROOT_TEMP_PATH, which /etc/enroot/enroot.conf pins to NFS /scratch. enroot-aufs2ovlfs unpacks the image's root-owned AUFS whiteout markers into a sticky /tmp and then can't unlink them over NFS (root-squash strips the CAP_FOWNER it needs), failing with 'failed to remove aufs whiteout: Operation not permitted' and producing no .sqsh -- which then surfaces as a pyxis 'No such file or directory' on the compute node.

Run the import on local disk, where the extracted files are owned by the runner user and removable. Scoped to the import subshell and cleaned up on exit, so salloc/srun and the compute node's own /scratch are unaffected. Proper fix is to point ENROOT_TEMP_PATH at local disk in enroot.conf cluster-wide; this is the no-root workaround.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

5 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

@Oseltamivir
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Oseltamivir Oseltamivir merged commit ef75a2a into main Jun 1, 2026
18 of 19 checks passed
@Oseltamivir Oseltamivir deleted the yyh/update-dsv4-b300-sglang-image branch June 1, 2026 22:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

yhyang201 added a commit that referenced this pull request Jun 2, 2026
…, switch to megamoe

Migrate the 7 STP disagg recipes to the megamoe MoE backend (deepep ->
megamoe, drop deepep-config) and strip obsolete SGLANG_OPT_*/SGLANG_DEEPEP
env vars now defaulted upstream, mirroring the b300 migration (#1506).
Clean the 5 dynamo recipes: fix container to dsv4-grace-blackwell, remove
personal extra_mount and hardcoded nodelist pins so they run on CI.
yhyang201 added a commit that referenced this pull request Jun 4, 2026
…, switch to megamoe

Migrate the 7 STP disagg recipes to the megamoe MoE backend (deepep ->
megamoe, drop deepep-config) and strip obsolete SGLANG_OPT_*/SGLANG_DEEPEP
env vars now defaulted upstream, mirroring the b300 migration (#1506).
Clean the 5 dynamo recipes: fix container to dsv4-grace-blackwell, remove
personal extra_mount and hardcoded nodelist pins so they run on CI.
yhyang201 added a commit that referenced this pull request Jun 4, 2026
…, switch to megamoe

Migrate the 7 STP disagg recipes to the megamoe MoE backend (deepep ->
megamoe, drop deepep-config) and strip obsolete SGLANG_OPT_*/SGLANG_DEEPEP
env vars now defaulted upstream, mirroring the b300 migration (#1506).
Clean the 5 dynamo recipes: fix container to dsv4-grace-blackwell, remove
personal extra_mount and hardcoded nodelist pins so they run on CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants