-
Notifications
You must be signed in to change notification settings - Fork 3
Example: Debug training runs #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…en and project name
…calculate gradient norms for batch (step) rather than epoch
…debugging when building LLM's
…sses using the data loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LeoRoccoBreedt - I've reviewed your changes - here's some feedback:
- The README still has TODO placeholders for
[debug]
and[debug-example]
links—please update these to point at the final docs and Neptune project URLs before merging. - The
num_layers
parameter inparams
isn’t actually used when building theSimpleModel
—either wire it into your layer loop or remove it to avoid confusion. - The
run_examples.sh
script installsrequirements.txt
from the current directory—consider using a relative path (e.g.-r $(dirname "$0")/requirements.txt
) to ensure it runs correctly in CI.
Here's what I looked at during the review
- 🟡 General issues: 7 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
how-to-guides/debug-model-training-runs/scripts/debug_training_runs.py
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/scripts/debug_training_runs.py
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/scripts/debug_training_runs.py
Outdated
Show resolved
Hide resolved
README.md
Outdated
[resume-run]: https://docs.neptune.ai/resume_run | ||
[runs-table]: https://docs.neptune.ai/runs_table | ||
[runs-table-example]: https://scale.neptune.ai/o/examples/org/LLM-Pretraining/runs/table?viewId=9e746462-f045-4ff2-9ac4-e41fa349b04d&detailsTab=dashboard&dash=table&type=run&compare=auto-5 | ||
[debug]: TODO - Add link to docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Placeholder for documentation link needs to be resolved.
Replace the 'TODO' with the correct documentation link for debugging.
how-to-guides/debug-model-training-runs/scripts/debug_training_runs.py
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
Co-authored-by: Edyta <[email protected]> Signed-off-by: Leo Breedt <[email protected]>
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
…ging_model_training
Co-authored-by: Edyta <[email protected]> Signed-off-by: Leo Breedt <[email protected]>
@LeoRoccoBreedt - I am unsubscribing from this. Please let me know once it is ready for review |
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
Co-authored-by: Edyta <[email protected]> Signed-off-by: Leo Breedt <[email protected]>
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb
Outdated
Show resolved
Hide resolved
Remove reference to Scale Co-authored-by: Edyta <[email protected]> Signed-off-by: Leo Breedt <[email protected]>
Description
Include a summary of the changes and the related issue.
Related to: <ClickUp/JIRA task name>
Any expected test failures?
Yes, Python versions 3.13 will fail due to PyTorch not being compatible with this version yet.
Add a
[X]
to relevant checklist items❔ This change
✔️ Pre-merge checklist
🧪 Test Configuration
Summary by Sourcery
Introduce a new “Debug training runs” example for Neptune that demonstrates tracking layer-wise gradient norms to diagnose model training issues, and integrate it into documentation and CI testing.
New Features:
Documentation:
Tests:
Summary by Sourcery
Add a complete “Debug training runs” example for Neptune, encompassing notebook and script versions, update documentation to reference the new tutorial, and extend CI workflows to test the new example
New Features:
CI:
Documentation: