-
Notifications
You must be signed in to change notification settings - Fork 273
Pull requests: radixark/miles
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload
#1468
opened Jun 23, 2026 by
guapisolo
Collaborator
Loading…
Bump Megatron-LM to miles-main-20260622 (latest NVIDIA dev)
run-ci-image
run-ci-megatron
#1466
opened Jun 23, 2026 by
yueming-yuan
Collaborator
Loading…
[AMD] Add AMD MI350X/MI355X (gfx950) blockwise FP8 support for run_qwen3_30b_a3b
#1465
opened Jun 22, 2026 by
JessicaJiang-123
Contributor
•
Draft
Enable observe training entropy without computing entropy loss
#1464
opened Jun 22, 2026 by
zyzshishui
Contributor
Loading…
perf(session): drop superseded routed_experts/indexer_topk blobs from…
#1463
opened Jun 22, 2026 by
guapisolo
Collaborator
Loading…
Add opt-in periodic py-spy dumper for hang debugging
#1461
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Add FT random and realistic-gsm8k e2e scenarios with periodic fault injection
#1460
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Add FT e2e test framework (conftest_ft harness)
#1456
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Add debug-exit-after-rollout to train entrypoints
#1455
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Always save rollout debug data regardless of rollout_global_dataset
#1454
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Start HTTP control server and mini FT controller in train entrypoints
#1453
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Add HTTP control server for cell suspend/resume and fault injection
#1452
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Wire FT event logging and component gating into RolloutManager
#1450
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Add CI rollout-data injection with recorded-data metadata round-trip
#1449
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Add FT test-action hooks to the train group
#1448
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Log train-group step-end and analysis events
#1447
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Inject witness ids into the Megatron forward and train step
#1446
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Bracket Megatron actor methods with the with_logs decorator
#1445
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Track rollout-engine connection staleness on the weight updater
#1444
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Kill failed cells immediately on execute failure
#1443
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Wire per-cell heartbeat health monitoring into the train group
#1442
opened Jun 22, 2026 by
fzyzcjy
Collaborator
Loading…
Previous Next
ProTip!
Filter pull requests by the default branch with base:main.