Skip to content

Conversation

@bkryu
Copy link
Collaborator

@bkryu bkryu commented Dec 5, 2025

📌 Description

We currently have unit tests failing as:

==========================================
Running: pytest --continue-on-collection-errors -s --junitxml=/junit/tests/comm/test_trtllm_mnnvl_allreduce.py.xml "tests/comm/test_trtllm_mnnvl_allreduce.py"
==========================================
Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1586):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090447
:
system msg for write_line failure : Bad file descriptor
Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1586):
...
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, markupsafe._speedups, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, mpi4py.MPI (total: 22)
!!!!!!! Segfault encountered !!!!!!!
...

❌ FAILED: tests/comm/test_trtllm_mnnvl_allreduce.py

These tests should be skipping in a single GPU environment, but are failing, which indicates that they are failing at MPI module load time.

The current dockerfile.cuXXX installs MPI via RUN conda install -n py312 -y mpi4py. Upon investigating the docker build logs,

A month ago (Nov. 4),

#17 13.68     mpi-1.0.1                  |            mpich           6 KB  conda-forge
#17 13.68     mpi4py-4.1.1               |py312hd0af0b3_100         866 KB  conda-forge
#17 13.68     mpich-4.3.2                |     h79b1c89_100         5.4 MB  conda-forge

was being installed, but yesterday:

#17 13.59     impi_rt-2021.13.1          |     ha770c72_769        41.7 MB  conda-forge
#17 13.59     mpi-1.0                    |             impi           6 KB  conda-forge
#17 13.59     mpi4py-4.1.1               |py312h18f78f0_102         864 KB  conda-forge

is being installed.

The mpich vs. impi are Implementations to the MPI: MPICH vs. Intel MPI. This is currently the suspected issue underlying the MPI load failures.

Current PR specifies the MPI implementation via RUN conda install -n py312 -y mpi4py mpich. The result of the current PR produces (build log):

#15 14.63     mpi-1.0.1                  |            mpich           6 KB  conda-forge
#15 14.63     mpi4py-4.1.1               |py312hd0af0b3_102         865 KB  conda-forge
#15 14.63     mpich-4.3.2                |     h79b1c89_100         5.4 MB  conda-forge

which now matches what we had before

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Across all CUDA version Dockerfiles (cu126, cu128, cu129, cu130), mpich is now installed alongside mpi4py in the py312 conda environment. This change is applied consistently to both standard and development-variant Dockerfiles, expanding the MPI backend availability.

Changes

Cohort / File(s) Summary
MPI Installation Enhancement
docker/Dockerfile.cu126, docker/Dockerfile.cu126.dev, docker/Dockerfile.cu128, docker/Dockerfile.cu128.dev, docker/Dockerfile.cu129, docker/Dockerfile.cu129.dev, docker/Dockerfile.cu130, docker/Dockerfile.cu130.dev
Added mpich to conda install command for mpi4py in py312 environment, changing the dependency set from mpi4py alone to mpi4py and mpich.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

This is a repetitive configuration change applied uniformly across 8 Dockerfiles with identical patterns. Each modification simply adds mpich to an existing conda install line with no logic, build flow, or behavioral changes.

Poem

🐰 Through Docker lanes where images grow,
We hopped in mpich, don't you know!
With mpi4py paired, both sides aligned,
Eight Dockerfiles enhanced, perfectly signed! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The PR title 'ci: Specify MPI implementation to mpich' clearly summarizes the main change across all modified Dockerfiles, which consistently add mpich as an MPI implementation alongside mpi4py.
Description check ✅ Passed The PR description comprehensively explains the MPI implementation issue, provides specific evidence from build logs, and clearly documents the problem and solution.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request standardizes the MPI environment within the project's Docker images by explicitly specifying mpich as the MPI implementation during the mpi4py installation. This modification is applied across all CUDA-specific and development Dockerfiles, aiming to ensure consistent behavior and prevent potential issues arising from differing or default MPI backends. The 'DO NOT MERGE' tag indicates this is likely a work-in-progress or experimental change.

Highlights

  • Explicit MPI Implementation: The mpich package is now explicitly installed alongside mpi4py in the py312 conda environment across all Dockerfiles.
  • Consistent Docker Environment: This change is applied uniformly to all Dockerfiles, covering various CUDA versions (cu126, cu128, cu129, cu130) and their development variants, ensuring a standardized MPI setup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request consistently specifies mpich as the MPI implementation when installing mpi4py across all Dockerfiles. This is a good practice for ensuring reproducible environments. My main feedback is to also clean the conda cache after installation to optimize the Docker image sizes. I've added specific suggestions for this on each of the changed lines.

Additionally, I've noticed significant duplication across the various Dockerfiles (cu126, cu128, etc., and their .dev variants). While a full refactor is outside the scope of this PR, you might consider consolidating them in the future using a single base Dockerfile with build arguments to handle the CUDA version differences. This would greatly improve maintainability.

@bkryu bkryu changed the title [DO NOT MERGE] Specify MPI implementation to mpich ci: Specify MPI implementation to mpich Dec 5, 2025
Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

In the long term, I wonder should we consider installing mpi from apt-get like in sglang: https://github.com/sgl-project/sglang/blob/cee93a6f26023d978b5187725bcb3c15ba604343/docker/Dockerfile#L474

@yzh119 yzh119 merged commit 185d63a into flashinfer-ai:main Dec 6, 2025
15 checks passed
@bkryu bkryu deleted the specify_mpi_impl branch December 8, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants