Skip to content

Conversation

hanwen-cluster
Copy link
Contributor

@hanwen-cluster hanwen-cluster commented Sep 19, 2025

Description of changes

The new NCCL version has some performance improvement on Blackwell.

This upgrade makes NCCL performance on two p6-b200 15% better

Therefore, this PR also updates the baseline numbers.

See commit descriptions for details

Tests

  • NCCL test has passed on RHEL9 with better performance. NCCL test on other OSes is running

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The new NCCL version has some performance improvement on Blackwell.
See NCCL release note: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-28-3.html#rel_2-28-3

This upgrade makes NCCL performance on two p6-b200 15% better
@hanwen-cluster hanwen-cluster requested review from a team as code owners September 19, 2025 18:12
@hanwen-cluster hanwen-cluster changed the title [integ-tests] Upgrade NCCL versions [integ-tests] Upgrade NCCL versions and increase baseline numbers Sep 19, 2025
gmarciani
gmarciani previously approved these changes Sep 19, 2025
The baselines are 90% of the current performance
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-3.14@25ff751). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             release-3.14    #7012   +/-   ##
===============================================
  Coverage                ?   90.18%           
===============================================
  Files                   ?      182           
  Lines                   ?    16472           
  Branches                ?        0           
===============================================
  Hits                    ?    14856           
  Misses                  ?     1616           
  Partials                ?        0           
Flag Coverage Δ
unittests 90.18% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hanwen-cluster hanwen-cluster enabled auto-merge (rebase) September 19, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants