[mcp] feat: add JSON-RPC server, GPU info listing, and contribution guide#59
[mcp] feat: add JSON-RPC server, GPU info listing, and contribution guide#59Wangmerlyn merged 13 commits intomainfrom
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds an experimental stdin/stdout JSON‑RPC MCP server and console script, a GPU info aggregation utility (NVML/ROCm/Torch), documentation updates (README, getting‑started, contributing), a mkdocs nav entry, new tests for server and GPU info, and removes pytest marker registrations from conftest. Changes
sequenceDiagram
autonumber
participant Client as MCP Client (stdin/stdout)
participant Server as KeepGPUServer
participant Controller as GlobalGPUController (factory)
participant Backend as GPU backends (NVML/ROCm/Torch)
Client->>Server: JSON-RPC request (start_keep / status / stop_keep / list_gpus)
alt start_keep
Server->>Controller: controller_factory(...) -> controller instance
Controller-->>Server: controller started
Server->>Server: store Session(params, controller)
Server-->>Client: JSON-RPC response (job_id, status)
else list_gpus
Server->>Backend: get_gpu_info()
Backend-->>Server: GPU list
Server-->>Client: JSON-RPC response (gpu list)
else status/stop_keep
Server->>Server: lookup Session(s)
Server->>Controller: call stop/release (if stopping)
Server-->>Client: JSON-RPC response (status or stopped_ids)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)src/keep_gpu/mcp/server.py (4)
🪛 Ruff (0.14.8)src/keep_gpu/mcp/server.py52-52: Avoid specifying long messages outside the exception class (TRY003) 115-115: Do not catch blind exception: (BLE001) 134-134: Abstract (TRY301) 134-134: Avoid specifying long messages outside the exception class (TRY003) 135-135: Consider moving this statement to an (TRY300) 150-150: Do not catch blind exception: (BLE001) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (11)
Comment |
Summary of ChangesHello @Wangmerlyn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the KeepGPU project by introducing an experimental JSON-RPC server, enabling external automation clients to manage GPU sessions and query GPU details. Alongside this, a new utility for comprehensive GPU information retrieval has been added, supporting both NVIDIA (CUDA) and AMD (ROCm) platforms. To foster community involvement, a detailed contribution guide has also been integrated into the documentation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a JSON-RPC server for controlling KeepGPU, along with a utility for querying GPU information. The changes also include comprehensive documentation and tests for the new functionality.
My review has identified a critical bug in the status method of the new server that would cause it to crash due to a serialization error. I've also suggested a minor refactoring for cleaner code in the server and an additional test case to improve coverage and prevent similar bugs in the future.
Overall, this is a great addition to the project. Addressing the identified issues will make the new server robust and reliable.
| # stop_keep | ||
| stop_req = {"id": 3, "method": "stop_keep", "params": {"job_id": job_id}} | ||
| stop_resp = _handle_request(server, stop_req) | ||
| assert job_id in stop_resp["result"]["stopped"] |
There was a problem hiding this comment.
The test suite is missing a case for the status() method when called without a job_id to list all active jobs. Adding such a test would improve coverage and would have helped catch the serialization bug in the status method.
Please consider adding a test case like this:
def test_status_all():
server = KeepGPUServer(controller_factory=dummy_factory)
job_a = server.start_keep(gpu_ids=[0])["job_id"]
job_b = server.start_keep(gpu_ids=[1])["job_id"]
status = server.status()
assert "active_jobs" in status
assert len(status["active_jobs"]) == 2
job_statuses = {job["job_id"]: job for job in status["active_jobs"]}
assert job_a in job_statuses
assert job_b in job_statuses
assert job_statuses[job_a]["params"]["gpu_ids"] == [0]
assert job_statuses[job_b]["params"]["gpu_ids"] == [1]
# Check that controller object is not in the response
assert "controller" not in job_statuses[job_a]There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return { | ||
| "active_jobs": [ | ||
| {"job_id": jid, **asdict(sess)} for jid, sess in self._sessions.items() | ||
| ] |
There was a problem hiding this comment.
status without job_id includes non-serializable controller
The list-all branch of status builds each job entry with asdict(sess), which includes the GlobalGPUController instance. When a client calls status without a job_id after any start_keep has run, main() will try to json.dumps the response and raise TypeError: Object of type GlobalGPUController is not JSON serializable, crashing the MCP server instead of returning active jobs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tests/mcp/test_server.py (1)
4-78: MCP server tests are solid; consider asserting controller behaviorThe tests exercise the public API (
start_keep,status,stop_keep,list_gpus) and the JSON-RPC handler in a realistic way, which is great.If you want slightly stronger guarantees, you could also assert that
DummyController.keptandDummyController.releasedflip as expected when sessions are started/stopped to validate that the server is actually callingkeep()/release()on the controller instances, not just mutating internal bookkeeping. This is optional but would make refactors inKeepGPUServersafer.docs/getting-started.md (1)
48-78: MCP endpoint docs are accurate and consistentThe MCP section correctly reflects the CLI entry (
keep-gpu-mcp-server), supported methods, and JSON-RPC usage, and it matches the server implementation/tests.If you find the MCP section drifting between README and here in future changes, consider centralizing details (e.g., in this doc) and linking from README to reduce duplication, but it’s fine as-is.
tests/utilities/test_gpu_info.py (1)
8-126: GPU info tests are well-targeted; small lint and robustness tweaks possibleThe NVML and ROCm tests exercise the intended code paths in
get_gpu_info()nicely, including import behavior viasys.modulesand the expected fields in the returned dicts.Two minor follow-ups you might consider:
Unused dummy arguments: Static analysis is right that
handleinnvmlDeviceGetMemoryInfo/nvmlDeviceGetUtilizationRates/nvmlDeviceGetNameandidxinDummyTorchCuda.set_deviceare unused. Renaming them to_handle/_idx(or just_) will keep Ruff quiet without changing behavior.Optionally assert cleanup for NVML: Since
DummyNVMLtracksshutdown_calls, you could assertdummy_nvml.shutdown_calls == 1intest_get_gpu_info_nvmlto verify that_query_nvmlalways shuts down the NVML backend, which is exactly what the production code intends.These are polish-only; the current tests are functionally correct.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
README.md(1 hunks)docs/contributing.md(1 hunks)docs/getting-started.md(1 hunks)mkdocs.yml(1 hunks)pyproject.toml(1 hunks)src/keep_gpu/mcp/server.py(1 hunks)src/keep_gpu/utilities/gpu_info.py(1 hunks)tests/conftest.py(0 hunks)tests/mcp/test_server.py(1 hunks)tests/utilities/test_gpu_info.py(1 hunks)
💤 Files with no reviewable changes (1)
- tests/conftest.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/mcp/test_server.py (1)
src/keep_gpu/mcp/server.py (6)
KeepGPUServer(33-117)_handle_request(120-138)start_keep(42-71)status(91-105)stop_keep(73-89)list_gpus(107-110)
tests/utilities/test_gpu_info.py (1)
src/keep_gpu/utilities/gpu_info.py (1)
get_gpu_info(125-144)
src/keep_gpu/mcp/server.py (3)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
GlobalGPUController(10-82)src/keep_gpu/utilities/gpu_info.py (1)
get_gpu_info(125-144)src/keep_gpu/utilities/logger.py (1)
setup_logger(67-98)
🪛 Ruff (0.14.8)
tests/utilities/test_gpu_info.py
42-42: Unused static method argument: handle
(ARG004)
46-46: Unused static method argument: handle
(ARG004)
50-50: Unused static method argument: handle
(ARG004)
91-91: Unused static method argument: idx
(ARG004)
src/keep_gpu/utilities/gpu_info.py
39-40: try-except-pass detected, consider logging the exception
(S110)
39-39: Do not catch blind exception: Exception
(BLE001)
47-47: Do not catch blind exception: Exception
(BLE001)
60-60: Do not catch blind exception: Exception
(BLE001)
67-67: Do not catch blind exception: Exception
(BLE001)
72-72: Do not catch blind exception: Exception
(BLE001)
88-89: try-except-pass detected, consider logging the exception
(S110)
88-88: Do not catch blind exception: Exception
(BLE001)
104-104: Do not catch blind exception: Exception
(BLE001)
108-108: Do not catch blind exception: Exception
(BLE001)
120-120: Do not catch blind exception: Exception
(BLE001)
134-134: Do not catch blind exception: Exception
(BLE001)
141-141: Do not catch blind exception: Exception
(BLE001)
src/keep_gpu/mcp/server.py
52-52: Avoid specifying long messages outside the exception class
(TRY003)
115-115: Do not catch blind exception: Exception
(BLE001)
134-134: Abstract raise to an inner function
(TRY301)
134-134: Avoid specifying long messages outside the exception class
(TRY003)
135-135: Consider moving this statement to an else block
(TRY300)
150-150: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: Run pre-commit checks
- GitHub Check: Build documentation
🔇 Additional comments (13)
docs/contributing.md (1)
1-62: Contributing guide is clear and aligned with the test/doc structureThe setup, test markers, MCP notes, and PR expectations look consistent with the rest of the repo and this PR’s changes. No issues from my side.
mkdocs.yml (1)
26-27: MkDocs nav entry for Contributing looks correctThe new
Project > Contributingitem points tocontributing.mdand matches the new doc. Indentation and structure are valid.README.md (1)
93-119: README MCP section and contributing link look goodThe MCP endpoint description (methods, example request, stdio config) matches the implementation, and the new link to
docs/contributing.mdnicely surfaces the contributor guide. No changes needed.pyproject.toml (1)
35-38: Entry pointkeep_gpu.mcp.server:mainis properly defined and CLI-safeThe
main()function exists at line 141 insrc/keep_gpu/mcp/server.pywith the correct signaturedef main() -> None:. It implements a JSON-RPC server that reads from stdin, processes requests, and writes responses to stdout—a standard pattern for console script entry points. The implementation includes proper error handling and output flushing.src/keep_gpu/mcp/server.py (9)
1-24: LGTM! Clean module structure and imports.The module docstring clearly explains the JSON-RPC protocol, and the imports are well-organized. Logger setup follows the project's utility patterns.
27-30: LGTM! Simple and appropriate session structure.The Session dataclass effectively encapsulates the controller and parameters for each active session.
33-40: LGTM! Good use of dependency injection and cleanup registration.The constructor properly initializes state and registers cleanup via atexit. The controller_factory parameter enables testing with mock controllers.
73-89: LGTM! Proper resource cleanup for both single and bulk stop.The method correctly handles stopping a specific session or all sessions, properly releasing controller resources in each case.
107-110: LGTM! Clean delegation to GPU info utility.The method properly wraps
get_gpu_info()and returns the result in a consistent format.
112-117: LGTM! Appropriate defensive cleanup for shutdown.The broad exception catch is justified for the atexit handler to prevent errors during interpreter teardown.
120-138: LGTM! Standard JSON-RPC request handling.The function properly dispatches methods and returns standardized JSON-RPC responses. The broad exception catch with logging is appropriate for an RPC handler.
141-157: LGTM! Robust stdin/stdout JSON-RPC loop.The main function correctly implements a line-based JSON-RPC server over stdin/stdout, with proper error handling and output flushing for reliable communication.
42-71: No issue identified withbusy_thresholddefault.The
-1default is intentional. In the comparison logic (if gpu_utilization > self.busy_threshold), a value of-1means the condition is always true (any utilization exceeds-1), effectively disabling utilization-based throttling. This is the correct semantic for an API-level parameter:-1signals "no threshold enforcement," while the underlyingGlobalGPUControllerand individual GPU controllers use10as their concrete default when an actual threshold is needed. The design is consistent and correct.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/keep_gpu/mcp/server.py (1)
119-137: Consider addressing Ruff style hints (optional).The implementation is functionally correct. However, Ruff flags a few style issues:
- Line 133: Long error message specified inline (TRY003)
- Lines 133-134: Consider restructuring the try-except flow (TRY300, TRY301)
These are minor style suggestions that could improve code consistency with linting rules, but are not critical.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/keep_gpu/mcp/server.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/keep_gpu/mcp/server.py (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
GlobalGPUController(10-82)src/keep_gpu/utilities/gpu_info.py (1)
get_gpu_info(125-144)src/keep_gpu/utilities/logger.py (1)
setup_logger(67-98)tests/mcp/test_server.py (2)
keep(13-14)release(16-17)
🪛 Ruff (0.14.8)
src/keep_gpu/mcp/server.py
52-52: Avoid specifying long messages outside the exception class
(TRY003)
114-114: Do not catch blind exception: Exception
(BLE001)
133-133: Abstract raise to an inner function
(TRY301)
133-133: Avoid specifying long messages outside the exception class
(TRY003)
134-134: Consider moving this statement to an else block
(TRY300)
149-149: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: Build documentation
- GitHub Check: Run pre-commit checks
🔇 Additional comments (1)
src/keep_gpu/mcp/server.py (1)
100-104: LGTM: Correctly excludes non-serializable controller.The implementation correctly returns only the serializable
paramsfield for each session, avoiding thecontrollerobject. This fixes the critical JSON serialization issue flagged in previous reviews.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/mcp/test_server.py (1)
4-21: Consider asserting controller lifecycle state.
DummyControllertrackskeptandreleasedflags, but no test verifies them. Adding assertions would confirm the controller'skeep()andrelease()methods are actually invoked.Example assertion in
test_start_status_stop_cycle:# After start_keep session = server._sessions[job_id] assert session.controller.kept is True # After stop_keep assert session.controller.released is Truesrc/keep_gpu/mcp/server.py (1)
140-152: Main loop correctly implements JSON-RPC over stdin/stdout.The line-by-line processing with immediate flush is appropriate for an MCP-style server. Consider adding signal handling (SIGINT/SIGTERM) for graceful shutdown in production use.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/keep_gpu/mcp/server.py(1 hunks)src/keep_gpu/utilities/gpu_info.py(1 hunks)tests/mcp/test_server.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/mcp/test_server.py (1)
src/keep_gpu/mcp/server.py (6)
KeepGPUServer(33-116)_handle_request(119-137)start_keep(42-71)status(90-104)stop_keep(73-88)list_gpus(106-109)
src/keep_gpu/mcp/server.py (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
GlobalGPUController(10-82)src/keep_gpu/utilities/gpu_info.py (1)
get_gpu_info(139-158)src/keep_gpu/utilities/logger.py (1)
setup_logger(67-98)tests/mcp/test_server.py (2)
keep(13-14)release(16-17)
🪛 GitHub Actions: pre-commit
src/keep_gpu/mcp/server.py
[error] 99-100: Black formatting check failed in pre-commit. 1 file reformatted (src/keep_gpu/mcp/server.py); please re-run 'pre-commit run --all-files' to apply formatting.
🪛 Ruff (0.14.8)
src/keep_gpu/mcp/server.py
52-52: Avoid specifying long messages outside the exception class
(TRY003)
114-114: Do not catch blind exception: Exception
(BLE001)
133-133: Abstract raise to an inner function
(TRY301)
133-133: Avoid specifying long messages outside the exception class
(TRY003)
134-134: Consider moving this statement to an else block
(TRY300)
149-149: Do not catch blind exception: Exception
(BLE001)
src/keep_gpu/utilities/gpu_info.py
39-40: try-except-pass detected, consider logging the exception
(S110)
39-39: Do not catch blind exception: Exception
(BLE001)
47-47: Do not catch blind exception: Exception
(BLE001)
63-63: Do not catch blind exception: Exception
(BLE001)
70-70: Do not catch blind exception: Exception
(BLE001)
75-75: Do not catch blind exception: Exception
(BLE001)
92-93: try-except-pass detected, consider logging the exception
(S110)
92-92: Do not catch blind exception: Exception
(BLE001)
96-97: try-except-pass detected, consider logging the exception
(S110)
96-96: Do not catch blind exception: Exception
(BLE001)
113-113: Do not catch blind exception: Exception
(BLE001)
117-117: Do not catch blind exception: Exception
(BLE001)
129-129: Do not catch blind exception: Exception
(BLE001)
134-135: try-except-pass detected, consider logging the exception
(S110)
134-134: Do not catch blind exception: Exception
(BLE001)
148-148: Do not catch blind exception: Exception
(BLE001)
155-155: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (10)
tests/mcp/test_server.py (1)
81-95: Good addition oftest_status_allper previous feedback.This test case properly validates the
status()method when called without ajob_id, covering the serialization of active jobs and ensuring thecontrollerfield is excluded from the response.src/keep_gpu/utilities/gpu_info.py (4)
44-98: Device state restoration is properly implemented.The
_query_rocmfunction now correctly capturescurrent_devicebefore iteration and restores it in thefinallyblock, preventing side effects on global CUDA state. This addresses the concern from the previous review.
101-136: LGTM - Torch query function properly preserves device state.The function captures
current_deviceat line 105 and restores it in thefinallyblock, ensuring no side effects on global CUDA device selection.
12-41: NVML query is correctly implemented.The function properly initializes/shuts down NVML in a
try/finallyblock and handles the bytes-to-string conversion for device names. Since it doesn't usetorch.cuda.set_device, no device restoration is needed.
139-158: Well-structured fallback chain for GPU info retrieval.The priority order (NVML → ROCm → Torch) is sensible, with NVML providing the most detailed information. Debug-level logging for failures is appropriate since these are expected on systems without specific GPU backends.
src/keep_gpu/mcp/server.py (5)
42-71: LGTM - Session creation is well implemented.The method correctly handles optional
job_id, checks for duplicates, and properly initializes the controller before storing the session.
73-88: Stop logic is correctly implemented.The method properly handles both single-job and all-jobs cases, using
pop()to atomically remove and retrieve sessions before releasing them.
90-104: Status serialization bug is fixed.The method now correctly excludes the non-serializable
controllerfield by only includingparamsin the response, addressing the critical issue from previous reviews.
111-116: Shutdown handler is appropriately defensive.The broad exception catch is intentional here to avoid noisy errors during interpreter teardown when resources may already be partially cleaned up.
119-137: Request handler correctly implements JSON-RPC dispatch.The function properly routes methods, handles errors, and returns standardized responses. The static analysis suggestions (TRY300/TRY301) are style preferences that don't affect correctness.
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.