[rocm, test] feat: add ROCm controller and test tooling#56
Conversation
Summary of ChangesHello @Wangmerlyn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the project's hardware compatibility by introducing comprehensive support for AMD ROCm GPUs. It provides a dedicated controller for managing ROCm devices, integrates this new capability into the overarching global GPU management system, and updates the testing infrastructure and documentation to facilitate development and testing on ROCm platforms. This change enables users with AMD GPUs to leverage the application's GPU-keeping functionalities. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for ROCm GPUs by adding a RocmGPUController and integrating it into the GlobalGPUController. It also adds corresponding test infrastructure, including a pytest marker and command-line option to run ROCm-specific tests. The implementation is solid, but there are a few areas with code duplication that could be refactored for better maintainability. Specifically, the GlobalGPUController has duplicated logic for CUDA and ROCm platforms, and the new RocmGPUController duplicates some initialization logic from other controllers. I've also noted a performance improvement opportunity in how rocm-smi is used and a minor redundancy in the test configuration. Overall, this is a great addition. The changes are well-structured and the tests are thoughtful. Addressing the feedback will help improve the long-term health of the codebase.
| elif self.computing_platform == ComputingPlatform.ROCM: | ||
| from keep_gpu.single_gpu_controller.rocm_gpu_controller import ( | ||
| RocmGPUController, | ||
| ) | ||
|
|
||
| if gpu_ids is None: | ||
| self.gpu_ids = list(range(torch.cuda.device_count())) | ||
| else: | ||
| self.gpu_ids = gpu_ids | ||
|
|
||
| self.controllers = [ | ||
| RocmGPUController( | ||
| rank=i, | ||
| interval=interval, | ||
| vram_to_keep=vram_to_keep, | ||
| busy_threshold=busy_threshold, | ||
| ) | ||
| for i in self.gpu_ids | ||
| ] |
There was a problem hiding this comment.
This elif block for ROCm is very similar to the existing if block for CUDA, leading to code duplication. To improve maintainability, you could refactor this. The logic for setting self.gpu_ids is identical and can be moved out of the conditional blocks. Then, you can use the conditional to select the appropriate controller class (CudaGPUController or RocmGPUController) and have a single list comprehension to create the self.controllers list. This would make the code DRY (Don't Repeat Yourself) and easier to extend.
| rocm_smi.rsmi_init() | ||
| # rsmi_dev_busy_percent_get returns percent (0-100) | ||
| util = rocm_smi.rsmi_dev_busy_percent_get(index) | ||
| rocm_smi.rsmi_shut_down() |
There was a problem hiding this comment.
Calling rocm_smi.rsmi_init() and rocm_smi.rsmi_shut_down() on every utilization query is inefficient. According to rocm-smi documentation, the library should be initialized once and shut down when no longer needed. Consider managing the rocm_smi lifecycle within the RocmGPUController itself, for example by calling rsmi_init() in __init__ or keep(), and rsmi_shut_down() in release(). The _query_rocm_utilization function could then be turned into a method of the class that doesn't handle initialization/shutdown.
| if isinstance(vram_to_keep, str): | ||
| vram_to_keep = self.parse_size(vram_to_keep) | ||
| elif not isinstance(vram_to_keep, int): | ||
| raise TypeError( | ||
| f"vram_to_keep must be str or int, got {type(vram_to_keep)}" | ||
| ) |
There was a problem hiding this comment.
This logic for parsing vram_to_keep appears to be duplicated in other controller classes. To adhere to the DRY principle, consider moving this logic to the BaseGPUController.__init__. The base class could then be responsible for handling both str and int types for vram_to_keep, simplifying the initializers of child controllers like this one.
| @staticmethod | ||
| def parse_size(text: str) -> int: | ||
| return parse_size(text) |
There was a problem hiding this comment.
| def pytest_configure(config): | ||
| config.addinivalue_line("markers", "rocm: tests that require ROCm stack") | ||
| config.addinivalue_line("markers", "large_memory: tests that use large VRAM") | ||
|
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rocm_smi.rsmi_init() | ||
| # rsmi_dev_busy_percent_get returns percent (0-100) | ||
| util = rocm_smi.rsmi_dev_busy_percent_get(index) |
There was a problem hiding this comment.
Initialize ROCm SMI with required flags
In _query_rocm_utilization the call to rocm_smi.rsmi_init() omits the required flags argument (ROCm SMI expects rsmi_init(0)); with a real rocm_smi present this raises TypeError, gets swallowed by the broad except, and the function always returns None. On ROCm systems this means busy-threshold gating never runs and the controller will keep firing workloads even when the GPU is already busy.
Useful? React with 👍 / 👎.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe PR adds ROCm GPU controller support alongside refactored base initialization. Changes include a new RocmGPUController class with tensor allocation and batch operations, platform-aware controller selection, base GPU controller parameter extension to accept humanized string input, pytest markers for ROCm/large_memory tests, and developer documentation with installation and testing guidance. Changes
Sequence Diagram(s)sequenceDiagram
participant GGC as GlobalGPUController
participant BC as BaseGPUController<br/>(vram_to_keep parsing)
participant CudaC as CudaGPUController
participant RocmC as RocmGPUController
GGC->>GGC: Detect platform (CUDA/ROCM)
alt CUDA platform
GGC->>GGC: controller_cls = CudaGPUController
else ROCm platform
GGC->>GGC: controller_cls = RocmGPUController
end
GGC->>BC: __init__(vram_to_keep: Union[int, str], ...)
BC->>BC: if isinstance(vram_to_keep, str):<br/> parse_size() → int
BC->>BC: validate type → raise if invalid
alt via CudaGPUController
BC->>CudaC: super().__init__()
else via RocmGPUController
BC->>RocmC: super().__init__()
RocmC->>RocmC: start daemon thread<br/>for keep_loop()
end
GGC->>GGC: Instantiate self.controllers<br/>using controller_cls
sequenceDiagram
participant User
participant RocmC as RocmGPUController
participant RT as Worker Thread<br/>(_keep_loop)
participant GPU as ROCm GPU
participant RSMI as rocm_smi
User->>RocmC: keep()
RocmC->>RT: Start daemon thread
RT->>GPU: Allocate tensor
loop every interval seconds
RT->>RSMI: _query_utilization()
alt GPU busy > threshold
RSMI-->>RT: utilization %
RT->>RT: Sleep
else GPU idle
RSMI-->>RT: utilization %
RT->>GPU: _run_batch() - ReLU ops
GPU-->>RT: Complete
end
end
User->>RocmC: release()
RocmC->>RT: Stop thread + clear cache
RT-->>RocmC: Shutdown complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/conftest.py (1)
14-17: Remove redundant marker registration.These markers are already defined in
pyproject.toml(lines 144-147). Pytest automatically reads marker definitions from[tool.pytest.ini_options], making this dynamic registration redundant. Duplicate registration can lead to maintenance issues if the definitions diverge.Apply this diff to remove the redundant registration:
-def pytest_configure(config): - config.addinivalue_line("markers", "rocm: tests that require ROCm stack") - config.addinivalue_line("markers", "large_memory: tests that use large VRAM") - -src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)
49-53:rsmi_init()likely requires a flags argument.The ROCm SMI library typically expects
rsmi_init(0)with an explicit flags parameter. Without it, this may raise aTypeErroron systems with realrocm_smiinstalled, causing the exception to be silently caught and utilization monitoring to fail.if self._rocm_smi: try: - self._rocm_smi.rsmi_init() + self._rocm_smi.rsmi_init(0) except Exception as exc: # pragma: no cover - env-specific logger.debug("rsmi_init failed: %s", exc)
🧹 Nitpick comments (3)
tests/rocm_controller/test_rocm_utilization.py (1)
8-36: Consider testing through public API instead of private function.The test directly invokes
rgc._query_rocm_utilization(1), which is a private function (underscore prefix). While the test structure is sound and properly mocks the ROCm SMI interface, testing private implementation details can make the test suite brittle to refactoring.Consider refactoring to test through the public API of
RocmGPUControllerinstead. For example, you could test thekeep()method or a public monitoring method that internally calls_query_rocm_utilization. This would verify the behavior from a user's perspective while still validating the ROCm integration.If direct testing of the private function is intentional for unit testing purposes, the current approach is acceptable but may require updates if the private API changes.
Note: Line 30's
DummyRocmSMI.calls = 0appears defensive but redundant since it's the first operation with the counter in this test.src/keep_gpu/single_gpu_controller/base_gpu_controller.py (1)
7-24: Excellent refactor centralizing type handling!The updated
__init__signature acceptingUnion[int, str]with runtime parsing successfully consolidates the vram_to_keep handling logic, eliminating duplication in subclasses likeCudaGPUController. The type validation and error messaging are clear and appropriate.Optional: The static analysis hint (TRY003) suggests extracting the exception message to a constant or using a custom exception class. However, the current inline message is clear and concise, making this an optional style improvement rather than a necessary change.
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)
107-112: Consider graceful exit when stopped during allocation retry.If
_stop_evtis set while retrying tensor allocation (e.g., duringrelease()), the loop exits withtensor = Noneand raisesRuntimeError. This exception propagates even though the controller is intentionally stopping.if tensor is None: - logger.error("rank %s: failed to allocate tensor, exiting loop", self.rank) - raise RuntimeError("Failed to allocate tensor for ROCm GPU keeping") + if self._stop_evt.is_set(): + logger.debug("rank %s: stop requested during allocation, exiting", self.rank) + return + logger.error("rank %s: failed to allocate tensor, exiting loop", self.rank) + raise RuntimeError("Failed to allocate tensor for ROCm GPU keeping")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md(1 hunks)docs/getting-started.md(1 hunks)pyproject.toml(1 hunks)src/keep_gpu/global_gpu_controller/global_gpu_controller.py(1 hunks)src/keep_gpu/single_gpu_controller/base_gpu_controller.py(1 hunks)src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py(1 hunks)src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py(1 hunks)tests/conftest.py(1 hunks)tests/rocm_controller/test_rocm_utilization.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (2)
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (1)
CudaGPUController(16-195)src/keep_gpu/utilities/platform_manager.py (1)
ComputingPlatform(12-15)
tests/rocm_controller/test_rocm_utilization.py (1)
tests/conftest.py (1)
rocm_available(30-31)
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (5)
src/keep_gpu/single_gpu_controller/base_gpu_controller.py (2)
BaseGPUController(6-59)keep(33-38)src/keep_gpu/utilities/logger.py (1)
setup_logger(67-98)src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (3)
keep(83-96)_keep_loop(123-162)release(98-110)src/keep_gpu/global_gpu_controller/global_gpu_controller.py (2)
keep(60-62)release(68-75)tests/rocm_controller/test_rocm_utilization.py (3)
rsmi_init(17-18)rsmi_shut_down(26-27)rsmi_dev_busy_percent_get(21-23)
src/keep_gpu/single_gpu_controller/base_gpu_controller.py (2)
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (1)
parse_size(76-77)src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
parse_size(65-66)
🪛 Ruff (0.14.5)
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py
41-41: Do not catch blind exception: Exception
(BLE001)
52-52: Do not catch blind exception: Exception
(BLE001)
74-74: Do not catch blind exception: Exception
(BLE001)
91-91: Do not catch blind exception: Exception
(BLE001)
112-112: Avoid specifying long messages outside the exception class
(TRY003)
src/keep_gpu/single_gpu_controller/base_gpu_controller.py
20-22: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run pre-commit checks
- GitHub Check: Build documentation
- GitHub Check: build
🔇 Additional comments (13)
pyproject.toml (1)
143-147: LGTM! Pytest markers are properly configured.The marker definitions follow pytest conventions and provide clear descriptions for ROCm and large memory tests.
src/keep_gpu/single_gpu_controller/cuda_gpu_controller.py (2)
64-64: Good refactor delegating parsing to the base class.Passing
vram_to_keepdirectly to the base class eliminates duplicate parsing logic and centralizes type handling inBaseGPUController.__init__.
180-180: LGTM! Log message formatting improved.The change from en dash to hyphen improves consistency in log messages.
README.md (1)
85-91: Excellent developer documentation!The new section provides clear guidance for contributors, including development setup, fast validation checks, and ROCm-specific test execution. This aligns well with the test infrastructure changes in the PR.
docs/getting-started.md (1)
42-53: LGTM! Contributor documentation is well-structured.The contributor section provides clear setup instructions and aligns with the developer guidance in README.md. The editable install example is practical and helpful.
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
28-58: Excellent refactor addressing code duplication!The refactored code consolidates controller instantiation logic using
controller_cls, successfully eliminating the duplication flagged in the previous review. The unified approach withgpu_idsassignment and controller list comprehension makes the code more maintainable and easier to extend for additional platforms.tests/conftest.py (3)
5-11: LGTM! ROCm CLI option follows pytest best practices.The
--run-rocmoption enables selective execution of ROCm tests, which is appropriate for optional hardware-specific testing.
19-26: LGTM! Test skipping logic is well-implemented.The collection modifier appropriately skips ROCm tests by default, requiring explicit opt-in via
--run-rocm. This prevents accidental execution on systems without ROCm.
29-31: LGTM! ROCm detection logic is correct.The fixture properly detects ROCm availability by checking both CUDA API availability and the presence of
torch.version.hip, which is specific to ROCm builds.src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (4)
19-43: LGTM! Clean initialization with proper delegation to base class.The lazy import pattern for
rocm_smiwith graceful fallback is appropriate. The broadExceptioncatch (line 41) is acceptable here since import failures can manifest as various exception types depending on the environment.
64-76: LGTM!Clean shutdown sequence with proper thread synchronization and resource cleanup. The structure correctly mirrors
CudaGPUController.release().
85-93: Good refactoring - SMI lifecycle now managed at controller level.This addresses the prior feedback about inefficient per-query
rsmi_init()/rsmi_shut_down()calls. The SMI is now initialized once inkeep()and shut down inrelease().
130-143: LGTM!Good use of
@torch.no_grad()to avoid gradient tracking overhead. The early exit check on_stop_evtensures responsive shutdown, andtorch.cuda.synchronize()guarantees accurate timing measurements.
Summary by CodeRabbit
New Features
vram_to_keepparameter now accepts human-readable string values (e.g., "1000 MB") in addition to integers.Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.