Skip to content

Conversation

LeoRoccoBreedt
Copy link
Contributor

@LeoRoccoBreedt LeoRoccoBreedt commented Apr 16, 2025

Description

Include a summary of the changes and the related issue.

Related to: <ClickUp/JIRA task name>

Any expected test failures?
Yes, Python versions 3.13 will fail due to PyTorch not being compatible with this version yet.


Add a [X] to relevant checklist items

❔ This change

  • adds a new feature
  • fixes breaking code
  • is cosmetic (refactoring/reformatting)

✔️ Pre-merge checklist

  • Refactored code (sourcery)
  • Tested code locally
  • Precommit installed and run before pushing changes
  • Added code to GitHub tests (notebooks, scripts)
  • Updated GitHub README
  • Updated the projects overview page on Notion

🧪 Test Configuration

  • OS: Windows
  • Python version: 3.11
  • Neptune version: neptune-scale 0.13.0
  • Affected libraries with version: torch

Summary by Sourcery

Introduce a new “Debug training runs” example for Neptune that demonstrates tracking layer-wise gradient norms to diagnose model training issues, and integrate it into documentation and CI testing.

New Features:

  • Add a tutorial notebook for debugging model training runs with Neptune
  • Integrate the debug training runs example into the main README and example table

Documentation:

  • Update README with links and placeholders for the new debug training runs example

Tests:

  • Add the debug_training_runs notebook to the test-notebooks CI workflow

Summary by Sourcery

Add a complete “Debug training runs” example for Neptune, encompassing notebook and script versions, update documentation to reference the new tutorial, and extend CI workflows to test the new example

New Features:

  • Introduce a new “Debug training runs” tutorial demonstrating layer-wise gradient norm tracking with Neptune
  • Provide both a Jupyter notebook and a Python script implementation of the debug training runs example

CI:

  • Include the debug training runs notebook and script in the GitHub Actions test-notebooks and test-scripts workflows

Documentation:

  • Update the README to include the debug training runs example with placeholder links to documentation and projects

…calculate gradient norms for batch (step) rather than epoch
@LeoRoccoBreedt LeoRoccoBreedt marked this pull request as ready for review June 2, 2025 11:54
@LeoRoccoBreedt LeoRoccoBreedt requested a review from a team June 2, 2025 11:54
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LeoRoccoBreedt - I've reviewed your changes - here's some feedback:

  • The README still has TODO placeholders for [debug] and [debug-example] links—please update these to point at the final docs and Neptune project URLs before merging.
  • The num_layers parameter in params isn’t actually used when building the SimpleModel—either wire it into your layer loop or remove it to avoid confusion.
  • The run_examples.sh script installs requirements.txt from the current directory—consider using a relative path (e.g. -r $(dirname "$0")/requirements.txt) to ensure it runs correctly in CI.
Here's what I looked at during the review
  • 🟡 General issues: 7 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

README.md Outdated
[resume-run]: https://docs.neptune.ai/resume_run
[runs-table]: https://docs.neptune.ai/runs_table
[runs-table-example]: https://scale.neptune.ai/o/examples/org/LLM-Pretraining/runs/table?viewId=9e746462-f045-4ff2-9ac4-e41fa349b04d&detailsTab=dashboard&dash=table&type=run&compare=auto-5
[debug]: TODO - Add link to docs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Placeholder for documentation link needs to be resolved.

Replace the 'TODO' with the correct documentation link for debugging.

Co-authored-by: Edyta <[email protected]>
Signed-off-by: Leo Breedt <[email protected]>
@SiddhantSadangi
Copy link
Member

@LeoRoccoBreedt - I am unsubscribing from this. Please let me know once it is ready for review

szaganek
szaganek previously approved these changes Jun 26, 2025
Remove reference to Scale

Co-authored-by: Edyta <[email protected]>
Signed-off-by: Leo Breedt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants