-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Bug] Preventing side effects in distributed tests by destroying process group … #10314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…up in `test_molecule_gpt_datase`.
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10314 +/- ##
==========================================
- Coverage 86.11% 85.58% -0.53%
==========================================
Files 496 498 +2
Lines 33655 34146 +491
==========================================
+ Hits 28981 29223 +242
- Misses 4674 4923 +249 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
# Optional cleanup if distributed is initialized | ||
if dist.is_initialized(): | ||
dist.destroy_process_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about moving to test/conftest.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@xnuohz: Thanks, it was a good suggestion!
in
test_molecule_gpt_datase
.This PR addresses a test isolation issue by explicitly destroying the distributed process group at the end of the
test_molecule_gpt_dataset
. This is necessary because:The test initializes a distributed environment (e.g., via
torch.distributed
backend implicitly or via PyG internals).If the process group is not properly destroyed, subsequent tests may fail with errors like:
This issue was observed specifically on GB200 machines when running multiple CI tests in sequence.
For instance, this bug manifests when running:
but does not appear when running the second test (test_link_pred_metric.py) independently or before the first.
This change ensures proper resource cleanup in CI and avoids test interference across modules that may share distributed state.