feat: Add multi-stage data training support#37
Open
xrsrke wants to merge 7 commits intodev-updated-againfrom
Open
feat: Add multi-stage data training support#37xrsrke wants to merge 7 commits intodev-updated-againfrom
xrsrke wants to merge 7 commits intodev-updated-againfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
read
docs/data_stages.mdMulti-Stage Data Training
Multi-stage training allows switching between different data mixtures at specified training steps, similar to approaches used in Qwen3, DeepSeek-V3, and Llama 3.
Quick Start
Data stages are optional. If no
[[training.data_stages]]are defined, a single stage is auto-created from[training]data fields (backward compatible). When stages ARE defined, they override[training]data fields completely.Multi-Stage Example
Define
[[training.data_stages]]sections for multi-stage training:Configuration Fields
Each
[[training.data_stages]]section must define all data-related fields explicitly:namestart_stepend_stepdatasetdataset_pathdataset_type"huggingface","nanoset","preprocessed","packed_memmap"dataset_foldersdataset_weightsdataset_random_seedtraining.dataset_random_seed)seq_len*Required based on
dataset_type:datasetfor huggingface,dataset_foldersfor nanoset.Single-Stage Training (Backward Compatible)
For single-stage training, you can simply use
[training]data fields - no[[training.data_stages]]needed:A single stage named "default" is auto-created internally. This maintains full backward compatibility with existing configs.
Alternatively, you can explicitly define a single stage:
Validation
The following validations are performed at startup:
name,start_step,dataset_type,seq_lenmust be defineddatasetrequired for huggingface,dataset_foldersrequired for nanosetseq_len > 0,dataset_random_seed >= 0,start_step < training.stepsCommon Patterns
Pattern 1: Change Data Mixture
Pattern 2: Context Extension
Pattern 3: Different Random Seeds (Multi-Epoch)
Pattern 4: Mid-Training Ablation
For ablation studies where you want to test different data mixtures from a checkpoint, you can add stages that start mid-training. The system will auto-create a "default" stage from
[training]fields for the gap.Ablation config (start new mixture at step 5):
The system auto-creates "default" for steps 0-5 from
[training], then transitions to "ablation_stage" at step 5.Logging
At training start, a stage plan is logged:
At each transition:
Checkpoint & Resume
Stage state is automatically saved in checkpoints:
stage_idx: Current stage indexstage_name: Current stage namedataloader_state: Position within the datasetOn resume, the exact stage and dataloader position are restored. No manual intervention needed.
Testing
Test Configs
Test configs are located in
docs/data_stages/configs/:data_stages_test.tomldata_stages_backcompat_test.tomldata_stages_ablation_test.tomlAutomated Test Suite
Run the test script to verify all functionality:
The test suite runs 5 tests:
Manual Testing
What the Tests Verify
[[training.data_stages]]still work[[training.data_stages]]starts after step 0 (e.g., step 5), the system auto-creates a "default" stage from[training]fields to cover the gap (steps 0-5). This lets you train initially with[training]only, then later add stages mid-training to test different data mixtures from a checkpoint.