Skip to content

ci: add real training jobs to nightly workflow#1313

Draft
garrett4wade wants to merge 6 commits into
mainfrom
fw/nightly-ci
Draft

ci: add real training jobs to nightly workflow#1313
garrett4wade wants to merge 6 commits into
mainfrom
fw/nightly-ci

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

Summary

  • Replace dummy placeholder with actual gsm8k GRPO training inside Docker containers (dev-sglang / dev-vllm)
  • Run via persistent containers + docker exec (matching test-areal.yml pattern)
  • Round-robin training backend (fsdp → megatron → archon) by day-of-year
  • Add ref input to workflow_dispatch for running arbitrary branches

Replace dummy test with actual gsm8k GRPO training inside Docker
containers (dev-sglang/dev-vllm). Add workflow_dispatch input to
run arbitrary branches.

Key changes:
- Pull latest runtime images, run via docker exec in persistent containers
- Install AReaL from source (uv pip install -e . --no-deps)
- Round-robin training backend (fsdp/megatron/archon) by day
- Add 'ref' input for branch/tag/SHA override on manual dispatch
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Docker containers run as root, creating .pyc files owned by
root:root on the bind-mounted workspace. The next matrix job
fails when actions/checkout tries to git clean these files.

Key changes:
- Add sudo rm cleanup before checkout
- Clean root-owned files in teardown step
Key changes:
- Run sglang/vllm sequentially via shared run_variant function
- Pre-download model to persistent /opt/hf_cache volume
- Enable wandb online logging with nightly environment secret
- Trial name includes backend+variant shortcodes (e.g., m.s-2026-05-09)
Runner user lacks passwordless sudo. The cache dir is
pre-created on the instance.
Pass model ID to actor.path and let from_pretrained resolve
from the HF cache. Pre-download just populates the cache.
@garrett4wade garrett4wade marked this pull request as draft May 11, 2026 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant