Skip to content

Conversation

MerlinRaptor
Copy link
Collaborator

Aside from mesh_shape assignment in src/lm_saes/runners/train.py (train_sae) 96 line, there should be no changes outside of molt, so this is expected to be a safe merge. Please let me know if you notice any unintended changes outside of molt.

Things to do: 1. a better rank distribution config setting strategy for pivoting model size easily; 2. distributed training logic 3. aliginging the output of prepare_input 4. unit tests
inplement low rank decomposed matrix multiplication; implement a tiny kernel fusion
exist a bug that make distributed and non-dist training not aligned and that will fixed later
…s done right

directly from_local(reconstruction) will not all reduce across devices for reconstructions; now reconstrution also support data parallelism and aligned with sae

there probably be minor bugs regarding decoder_norm and init, will be fixed later if necessary
…t inference and dist training

Distinct model_parallel_size_training form model_parallel_size_running
@MerlinRaptor MerlinRaptor requested review from Frankstein73, Hzfinfdu and dest1n1s and removed request for Hzfinfdu and dest1n1s August 11, 2025 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants