Skip to content

[Bug] Preventing side effects in distributed tests by destroying process group … #10314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

drivanov
Copy link
Contributor

@drivanov drivanov commented Jun 9, 2025

in test_molecule_gpt_datase.

This PR addresses a test isolation issue by explicitly destroying the distributed process group at the end of the test_molecule_gpt_dataset. This is necessary because:

The test initializes a distributed environment (e.g., via torch.distributed backend implicitly or via PyG internals).

If the process group is not properly destroyed, subsequent tests may fail with errors like:

        group = group or _get_default_group()
        opts = AllgatherOptions()
        opts.asyncOp = async_op
>       work = group.allgather([tensor_list], [tensor], opts)
E       RuntimeError: No backend type associated with device type cpu

This issue was observed specifically on GB200 machines when running multiple CI tests in sequence.

For instance, this bug manifests when running:

pytest /opt/pyg/pytorch_geometric/test/datasets/test_molecule_gpt_dataset.py::test_molecule_gpt_dataset \
        /opt/pyg/pytorch_geometric/test/metrics/test_link_pred_metric.py

but does not appear when running the second test (test_link_pred_metric.py) independently or before the first.

This change ensures proper resource cleanup in CI and avoids test interference across modules that may share distributed state.

@drivanov drivanov requested a review from wsad1 as a code owner June 9, 2025 23:06
Copy link

codecov bot commented Jun 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.58%. Comparing base (c211214) to head (d955001).
Report is 55 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10314      +/-   ##
==========================================
- Coverage   86.11%   85.58%   -0.53%     
==========================================
  Files         496      498       +2     
  Lines       33655    34146     +491     
==========================================
+ Hits        28981    29223     +242     
- Misses       4674     4923     +249     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drivanov drivanov changed the title [Bug] Preventing side effects in distributed tests by desroying process group … [Bug] Preventing side effects in distributed tests by destroying process group … Jun 9, 2025
Comment on lines 14 to 16
# Optional cleanup if distributed is initialized
if dist.is_initialized():
dist.destroy_process_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about moving to test/conftest.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
@xnuohz: Thanks, it was a good suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants