Skip to content

[mcp] feat: add JSON-RPC server, GPU info listing, and contribution guide#59

Merged
Wangmerlyn merged 13 commits intomainfrom
mcp-support
Dec 9, 2025
Merged

[mcp] feat: add JSON-RPC server, GPU info listing, and contribution guide#59
Wangmerlyn merged 13 commits intomainfrom
mcp-support

Conversation

@Wangmerlyn
Copy link
Owner

@Wangmerlyn Wangmerlyn commented Dec 9, 2025

Summary by CodeRabbit

  • New Features

    • Experimental MCP JSON‑RPC endpoint for stdin/stdout GPU session management
    • New CLI command to run the MCP server
    • GPU info utility aggregating CUDA, ROCm, and PyTorch backends
  • Documentation

    • Comprehensive contributing/development guide added and linked in navigation
    • Getting‑started and README updated with MCP endpoint details and examples
    • Credits expanded to note project template origin
  • Tests

    • Added test suites for the MCP endpoint and GPU info retrieval

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds an experimental stdin/stdout JSON‑RPC MCP server and console script, a GPU info aggregation utility (NVML/ROCm/Torch), documentation updates (README, getting‑started, contributing), a mkdocs nav entry, new tests for server and GPU info, and removes pytest marker registrations from conftest.

Changes

Cohort / File(s) Summary
Documentation
README.md, docs/contributing.md, docs/getting-started.md
Added experimental MCP endpoint docs and example JSON‑RPC payloads; comprehensive contributing/development guide; Cookiecutter attribution; MCP notes appear in README and getting-started.
Site Navigation
mkdocs.yml
Added a top‑level Contributing page entry under Project in the site navigation.
MCP Server Implementation
src/keep_gpu/mcp/server.py
New KeepGPUServer and Session dataclass implementing JSON‑RPC handlers (start_keep, stop_keep, status, list_gpus), session lifecycle via controller factory, stdin/stdout IO loop, request dispatching, error handling, and shutdown; public console entrypoint referenced by pyproject.
GPU Information Utility
src/keep_gpu/utilities/gpu_info.py
New get_gpu_info() probing NVML, ROCm, and Torch backends; returns per‑GPU id, platform, name, memory totals/used, and utilization with backend fallbacks and debug logging.
Project Configuration
pyproject.toml
Added console script entry: keep-gpu-mcp-server = "keep_gpu.mcp.server:main".
Test Configuration
tests/conftest.py
Removed pytest marker registrations for rocm and large_memory (collection-skip logic remains).
Test Suites
tests/mcp/test_server.py, tests/utilities/test_gpu_info.py
New tests for KeepGPUServer JSON‑RPC flows (start/stop/status/list) using a dummy controller; tests for get_gpu_info() NVML and ROCm paths using monkeypatch/mocks.
sequenceDiagram
  autonumber
  participant Client as MCP Client (stdin/stdout)
  participant Server as KeepGPUServer
  participant Controller as GlobalGPUController (factory)
  participant Backend as GPU backends (NVML/ROCm/Torch)

  Client->>Server: JSON-RPC request (start_keep / status / stop_keep / list_gpus)
  alt start_keep
    Server->>Controller: controller_factory(...) -> controller instance
    Controller-->>Server: controller started
    Server->>Server: store Session(params, controller)
    Server-->>Client: JSON-RPC response (job_id, status)
  else list_gpus
    Server->>Backend: get_gpu_info()
    Backend-->>Server: GPU list
    Server-->>Client: JSON-RPC response (gpu list)
  else status/stop_keep
    Server->>Server: lookup Session(s)
    Server->>Controller: call stop/release (if stopping)
    Server-->>Client: JSON-RPC response (status or stopped_ids)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Inspect session lifecycle and concurrency in src/keep_gpu/mcp/server.py (job_id uniqueness, session storage, shutdown safety during interpreter teardown).
  • Verify backend initialization/shutdown and exception isolation in src/keep_gpu/utilities/gpu_info.py (pynvml/rocm_smi/torch interactions).
  • Confirm tests in tests/mcp/test_server.py and tests/utilities/test_gpu_info.py properly mock external libs and cover edge cases.
  • Ensure the console script in pyproject.toml matches the implemented main() in src/keep_gpu/mcp/server.py.

"I nibble bytes and hum a tune,
stdin and stdout beneath the moon,
I hold the GPUs snug and warm,
NVML, ROCm, Torch perform,
a rabbit hops — the servers bloom 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.88% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding a JSON-RPC server (KeepGPUServer), GPU info listing (get_gpu_info), and a contribution guide (docs/contributing.md).
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mcp-support

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfcb31f and 999b19b.

📒 Files selected for processing (1)
  • src/keep_gpu/mcp/server.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/keep_gpu/mcp/server.py (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
  • GlobalGPUController (10-82)
src/keep_gpu/utilities/gpu_info.py (1)
  • get_gpu_info (139-158)
src/keep_gpu/utilities/logger.py (1)
  • setup_logger (67-98)
tests/mcp/test_server.py (2)
  • keep (13-14)
  • release (16-17)
🪛 Ruff (0.14.8)
src/keep_gpu/mcp/server.py

52-52: Avoid specifying long messages outside the exception class

(TRY003)


115-115: Do not catch blind exception: Exception

(BLE001)


134-134: Abstract raise to an inner function

(TRY301)


134-134: Avoid specifying long messages outside the exception class

(TRY003)


135-135: Consider moving this statement to an else block

(TRY300)


150-150: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build documentation
  • GitHub Check: Run pre-commit checks
  • GitHub Check: build
🔇 Additional comments (11)
src/keep_gpu/mcp/server.py (11)

1-24: LGTM! Clean module structure and documentation.

The module docstring clearly describes the JSON-RPC interface, imports are minimal and necessary, and logger setup follows project patterns.


27-31: LGTM! Appropriate session state structure.

The Session dataclass cleanly separates the non-serializable controller from the JSON-serializable params, which correctly addresses the serialization issue flagged in previous reviews.


33-41: LGTM! Good use of dependency injection and cleanup registration.

The optional controller_factory enables test mocking (as seen in test_server.py), and atexit registration ensures GPU resources are released on process termination.


42-72: LGTM! Correct session initialization and GPU allocation.

The method properly validates duplicate job_ids, instantiates controllers with the provided parameters, starts the keep operation, and maintains session state. Parameter validation is appropriately delegated to GlobalGPUController.


73-89: LGTM! Correct implementation of single and batch stop operations.

The stop-all logic correctly uses pop() in the iteration to avoid modifying the dictionary during traversal, and each controller's release() method is called to free GPU resources. This implementation addresses the optimization suggested in previous reviews.


90-106: LGTM! Critical serialization issue resolved.

The list-all branch (lines 101-105) now correctly excludes the non-serializable controller by building response dicts with only job_id and params. This fix resolves the TypeError: Object of type GlobalGPUController is not JSON serializable bug flagged by multiple reviewers in previous commits.


107-111: LGTM! Clean integration with GPU info utility.

The method correctly delegates to get_gpu_info() which aggregates GPU information from NVML, ROCm, or Torch backends, returning JSON-serializable data.


112-118: LGTM! Appropriate defensive cleanup for interpreter teardown.

The broad exception handling is necessary here to prevent noisy errors during process shutdown when Python's runtime state may be partially torn down. The atexit registration in __init__ ensures this cleanup runs automatically.


120-139: LGTM! Robust JSON-RPC request handling with proper error propagation.

The function correctly dispatches to server methods, handles missing/null parameters defensively (line 122), and catches exceptions to return JSON-RPC error responses. The broad exception handling is appropriate for ensuring clients receive error messages rather than silent failures.


141-154: LGTM! Correct JSON-RPC server loop over stdin/stdout.

The main loop properly handles line-by-line JSON requests, flushes responses immediately for low-latency communication, and catches parse errors to maintain server availability. The broad exception handling on line 150 is necessary to prevent the server from crashing on malformed input.


156-157: LGTM! Standard Python entry point.


Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Wangmerlyn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the KeepGPU project by introducing an experimental JSON-RPC server, enabling external automation clients to manage GPU sessions and query GPU details. Alongside this, a new utility for comprehensive GPU information retrieval has been added, supporting both NVIDIA (CUDA) and AMD (ROCm) platforms. To foster community involvement, a detailed contribution guide has also been integrated into the documentation.

Highlights

  • JSON-RPC Server: An experimental JSON-RPC server (keep-gpu-mcp-server) has been introduced, allowing external clients to programmatically interact with GPU management functionalities like starting/stopping GPU keeping sessions.
  • GPU Information Listing: A new utility has been added to list detailed GPU information, including name, memory usage, and utilization, with support for both NVIDIA (NVML/CUDA) and AMD (ROCm SMI) platforms.
  • Contribution Guide: A comprehensive docs/contributing.md file has been added to provide clear guidelines for developers on setup, testing, linting, and pull request best practices.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a JSON-RPC server for controlling KeepGPU, along with a utility for querying GPU information. The changes also include comprehensive documentation and tests for the new functionality.

My review has identified a critical bug in the status method of the new server that would cause it to crash due to a serialization error. I've also suggested a minor refactoring for cleaner code in the server and an additional test case to improve coverage and prevent similar bugs in the future.

Overall, this is a great addition to the project. Addressing the identified issues will make the new server robust and reliable.

# stop_keep
stop_req = {"id": 3, "method": "stop_keep", "params": {"job_id": job_id}}
stop_resp = _handle_request(server, stop_req)
assert job_id in stop_resp["result"]["stopped"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test suite is missing a case for the status() method when called without a job_id to list all active jobs. Adding such a test would improve coverage and would have helped catch the serialization bug in the status method.

Please consider adding a test case like this:

def test_status_all():
    server = KeepGPUServer(controller_factory=dummy_factory)
    job_a = server.start_keep(gpu_ids=[0])["job_id"]
    job_b = server.start_keep(gpu_ids=[1])["job_id"]

    status = server.status()
    assert "active_jobs" in status
    assert len(status["active_jobs"]) == 2
    
    job_statuses = {job["job_id"]: job for job in status["active_jobs"]}
    assert job_a in job_statuses
    assert job_b in job_statuses
    assert job_statuses[job_a]["params"]["gpu_ids"] == [0]
    assert job_statuses[job_b]["params"]["gpu_ids"] == [1]
    # Check that controller object is not in the response
    assert "controller" not in job_statuses[job_a]

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 101 to 104
return {
"active_jobs": [
{"job_id": jid, **asdict(sess)} for jid, sess in self._sessions.items()
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge status without job_id includes non-serializable controller

The list-all branch of status builds each job entry with asdict(sess), which includes the GlobalGPUController instance. When a client calls status without a job_id after any start_keep has run, main() will try to json.dumps the response and raise TypeError: Object of type GlobalGPUController is not JSON serializable, crashing the MCP server instead of returning active jobs.

Useful? React with 👍 / 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
tests/mcp/test_server.py (1)

4-78: MCP server tests are solid; consider asserting controller behavior

The tests exercise the public API (start_keep, status, stop_keep, list_gpus) and the JSON-RPC handler in a realistic way, which is great.

If you want slightly stronger guarantees, you could also assert that DummyController.kept and DummyController.released flip as expected when sessions are started/stopped to validate that the server is actually calling keep()/release() on the controller instances, not just mutating internal bookkeeping. This is optional but would make refactors in KeepGPUServer safer.

docs/getting-started.md (1)

48-78: MCP endpoint docs are accurate and consistent

The MCP section correctly reflects the CLI entry (keep-gpu-mcp-server), supported methods, and JSON-RPC usage, and it matches the server implementation/tests.

If you find the MCP section drifting between README and here in future changes, consider centralizing details (e.g., in this doc) and linking from README to reduce duplication, but it’s fine as-is.

tests/utilities/test_gpu_info.py (1)

8-126: GPU info tests are well-targeted; small lint and robustness tweaks possible

The NVML and ROCm tests exercise the intended code paths in get_gpu_info() nicely, including import behavior via sys.modules and the expected fields in the returned dicts.

Two minor follow-ups you might consider:

  • Unused dummy arguments: Static analysis is right that handle in nvmlDeviceGetMemoryInfo / nvmlDeviceGetUtilizationRates / nvmlDeviceGetName and idx in DummyTorchCuda.set_device are unused. Renaming them to _handle / _idx (or just _) will keep Ruff quiet without changing behavior.

  • Optionally assert cleanup for NVML: Since DummyNVML tracks shutdown_calls, you could assert dummy_nvml.shutdown_calls == 1 in test_get_gpu_info_nvml to verify that _query_nvml always shuts down the NVML backend, which is exactly what the production code intends.

These are polish-only; the current tests are functionally correct.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2565dcf and b73e532.

📒 Files selected for processing (10)
  • README.md (1 hunks)
  • docs/contributing.md (1 hunks)
  • docs/getting-started.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyproject.toml (1 hunks)
  • src/keep_gpu/mcp/server.py (1 hunks)
  • src/keep_gpu/utilities/gpu_info.py (1 hunks)
  • tests/conftest.py (0 hunks)
  • tests/mcp/test_server.py (1 hunks)
  • tests/utilities/test_gpu_info.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/conftest.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/mcp/test_server.py (1)
src/keep_gpu/mcp/server.py (6)
  • KeepGPUServer (33-117)
  • _handle_request (120-138)
  • start_keep (42-71)
  • status (91-105)
  • stop_keep (73-89)
  • list_gpus (107-110)
tests/utilities/test_gpu_info.py (1)
src/keep_gpu/utilities/gpu_info.py (1)
  • get_gpu_info (125-144)
src/keep_gpu/mcp/server.py (3)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
  • GlobalGPUController (10-82)
src/keep_gpu/utilities/gpu_info.py (1)
  • get_gpu_info (125-144)
src/keep_gpu/utilities/logger.py (1)
  • setup_logger (67-98)
🪛 Ruff (0.14.8)
tests/utilities/test_gpu_info.py

42-42: Unused static method argument: handle

(ARG004)


46-46: Unused static method argument: handle

(ARG004)


50-50: Unused static method argument: handle

(ARG004)


91-91: Unused static method argument: idx

(ARG004)

src/keep_gpu/utilities/gpu_info.py

39-40: try-except-pass detected, consider logging the exception

(S110)


39-39: Do not catch blind exception: Exception

(BLE001)


47-47: Do not catch blind exception: Exception

(BLE001)


60-60: Do not catch blind exception: Exception

(BLE001)


67-67: Do not catch blind exception: Exception

(BLE001)


72-72: Do not catch blind exception: Exception

(BLE001)


88-89: try-except-pass detected, consider logging the exception

(S110)


88-88: Do not catch blind exception: Exception

(BLE001)


104-104: Do not catch blind exception: Exception

(BLE001)


108-108: Do not catch blind exception: Exception

(BLE001)


120-120: Do not catch blind exception: Exception

(BLE001)


134-134: Do not catch blind exception: Exception

(BLE001)


141-141: Do not catch blind exception: Exception

(BLE001)

src/keep_gpu/mcp/server.py

52-52: Avoid specifying long messages outside the exception class

(TRY003)


115-115: Do not catch blind exception: Exception

(BLE001)


134-134: Abstract raise to an inner function

(TRY301)


134-134: Avoid specifying long messages outside the exception class

(TRY003)


135-135: Consider moving this statement to an else block

(TRY300)


150-150: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: Run pre-commit checks
  • GitHub Check: Build documentation
🔇 Additional comments (13)
docs/contributing.md (1)

1-62: Contributing guide is clear and aligned with the test/doc structure

The setup, test markers, MCP notes, and PR expectations look consistent with the rest of the repo and this PR’s changes. No issues from my side.

mkdocs.yml (1)

26-27: MkDocs nav entry for Contributing looks correct

The new Project > Contributing item points to contributing.md and matches the new doc. Indentation and structure are valid.

README.md (1)

93-119: README MCP section and contributing link look good

The MCP endpoint description (methods, example request, stdio config) matches the implementation, and the new link to docs/contributing.md nicely surfaces the contributor guide. No changes needed.

pyproject.toml (1)

35-38: Entry point keep_gpu.mcp.server:main is properly defined and CLI-safe

The main() function exists at line 141 in src/keep_gpu/mcp/server.py with the correct signature def main() -> None:. It implements a JSON-RPC server that reads from stdin, processes requests, and writes responses to stdout—a standard pattern for console script entry points. The implementation includes proper error handling and output flushing.

src/keep_gpu/mcp/server.py (9)

1-24: LGTM! Clean module structure and imports.

The module docstring clearly explains the JSON-RPC protocol, and the imports are well-organized. Logger setup follows the project's utility patterns.


27-30: LGTM! Simple and appropriate session structure.

The Session dataclass effectively encapsulates the controller and parameters for each active session.


33-40: LGTM! Good use of dependency injection and cleanup registration.

The constructor properly initializes state and registers cleanup via atexit. The controller_factory parameter enables testing with mock controllers.


73-89: LGTM! Proper resource cleanup for both single and bulk stop.

The method correctly handles stopping a specific session or all sessions, properly releasing controller resources in each case.


107-110: LGTM! Clean delegation to GPU info utility.

The method properly wraps get_gpu_info() and returns the result in a consistent format.


112-117: LGTM! Appropriate defensive cleanup for shutdown.

The broad exception catch is justified for the atexit handler to prevent errors during interpreter teardown.


120-138: LGTM! Standard JSON-RPC request handling.

The function properly dispatches methods and returns standardized JSON-RPC responses. The broad exception catch with logging is appropriate for an RPC handler.


141-157: LGTM! Robust stdin/stdout JSON-RPC loop.

The main function correctly implements a line-based JSON-RPC server over stdin/stdout, with proper error handling and output flushing for reliable communication.


42-71: No issue identified with busy_threshold default.

The -1 default is intentional. In the comparison logic (if gpu_utilization > self.busy_threshold), a value of -1 means the condition is always true (any utilization exceeds -1), effectively disabling utilization-based throttling. This is the correct semantic for an API-level parameter: -1 signals "no threshold enforcement," while the underlying GlobalGPUController and individual GPU controllers use 10 as their concrete default when an actual threshold is needed. The design is consistent and correct.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/keep_gpu/mcp/server.py (1)

119-137: Consider addressing Ruff style hints (optional).

The implementation is functionally correct. However, Ruff flags a few style issues:

  • Line 133: Long error message specified inline (TRY003)
  • Lines 133-134: Consider restructuring the try-except flow (TRY300, TRY301)

These are minor style suggestions that could improve code consistency with linting rules, but are not critical.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b73e532 and 69c64da.

📒 Files selected for processing (1)
  • src/keep_gpu/mcp/server.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/keep_gpu/mcp/server.py (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
  • GlobalGPUController (10-82)
src/keep_gpu/utilities/gpu_info.py (1)
  • get_gpu_info (125-144)
src/keep_gpu/utilities/logger.py (1)
  • setup_logger (67-98)
tests/mcp/test_server.py (2)
  • keep (13-14)
  • release (16-17)
🪛 Ruff (0.14.8)
src/keep_gpu/mcp/server.py

52-52: Avoid specifying long messages outside the exception class

(TRY003)


114-114: Do not catch blind exception: Exception

(BLE001)


133-133: Abstract raise to an inner function

(TRY301)


133-133: Avoid specifying long messages outside the exception class

(TRY003)


134-134: Consider moving this statement to an else block

(TRY300)


149-149: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: Build documentation
  • GitHub Check: Run pre-commit checks
🔇 Additional comments (1)
src/keep_gpu/mcp/server.py (1)

100-104: LGTM: Correctly excludes non-serializable controller.

The implementation correctly returns only the serializable params field for each session, avoiding the controller object. This fixes the critical JSON serialization issue flagged in previous reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/mcp/test_server.py (1)

4-21: Consider asserting controller lifecycle state.

DummyController tracks kept and released flags, but no test verifies them. Adding assertions would confirm the controller's keep() and release() methods are actually invoked.

Example assertion in test_start_status_stop_cycle:

# After start_keep
session = server._sessions[job_id]
assert session.controller.kept is True

# After stop_keep
assert session.controller.released is True
src/keep_gpu/mcp/server.py (1)

140-152: Main loop correctly implements JSON-RPC over stdin/stdout.

The line-by-line processing with immediate flush is appropriate for an MCP-style server. Consider adding signal handling (SIGINT/SIGTERM) for graceful shutdown in production use.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69c64da and dfcb31f.

📒 Files selected for processing (3)
  • src/keep_gpu/mcp/server.py (1 hunks)
  • src/keep_gpu/utilities/gpu_info.py (1 hunks)
  • tests/mcp/test_server.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/mcp/test_server.py (1)
src/keep_gpu/mcp/server.py (6)
  • KeepGPUServer (33-116)
  • _handle_request (119-137)
  • start_keep (42-71)
  • status (90-104)
  • stop_keep (73-88)
  • list_gpus (106-109)
src/keep_gpu/mcp/server.py (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
  • GlobalGPUController (10-82)
src/keep_gpu/utilities/gpu_info.py (1)
  • get_gpu_info (139-158)
src/keep_gpu/utilities/logger.py (1)
  • setup_logger (67-98)
tests/mcp/test_server.py (2)
  • keep (13-14)
  • release (16-17)
🪛 GitHub Actions: pre-commit
src/keep_gpu/mcp/server.py

[error] 99-100: Black formatting check failed in pre-commit. 1 file reformatted (src/keep_gpu/mcp/server.py); please re-run 'pre-commit run --all-files' to apply formatting.

🪛 Ruff (0.14.8)
src/keep_gpu/mcp/server.py

52-52: Avoid specifying long messages outside the exception class

(TRY003)


114-114: Do not catch blind exception: Exception

(BLE001)


133-133: Abstract raise to an inner function

(TRY301)


133-133: Avoid specifying long messages outside the exception class

(TRY003)


134-134: Consider moving this statement to an else block

(TRY300)


149-149: Do not catch blind exception: Exception

(BLE001)

src/keep_gpu/utilities/gpu_info.py

39-40: try-except-pass detected, consider logging the exception

(S110)


39-39: Do not catch blind exception: Exception

(BLE001)


47-47: Do not catch blind exception: Exception

(BLE001)


63-63: Do not catch blind exception: Exception

(BLE001)


70-70: Do not catch blind exception: Exception

(BLE001)


75-75: Do not catch blind exception: Exception

(BLE001)


92-93: try-except-pass detected, consider logging the exception

(S110)


92-92: Do not catch blind exception: Exception

(BLE001)


96-97: try-except-pass detected, consider logging the exception

(S110)


96-96: Do not catch blind exception: Exception

(BLE001)


113-113: Do not catch blind exception: Exception

(BLE001)


117-117: Do not catch blind exception: Exception

(BLE001)


129-129: Do not catch blind exception: Exception

(BLE001)


134-135: try-except-pass detected, consider logging the exception

(S110)


134-134: Do not catch blind exception: Exception

(BLE001)


148-148: Do not catch blind exception: Exception

(BLE001)


155-155: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (10)
tests/mcp/test_server.py (1)

81-95: Good addition of test_status_all per previous feedback.

This test case properly validates the status() method when called without a job_id, covering the serialization of active jobs and ensuring the controller field is excluded from the response.

src/keep_gpu/utilities/gpu_info.py (4)

44-98: Device state restoration is properly implemented.

The _query_rocm function now correctly captures current_device before iteration and restores it in the finally block, preventing side effects on global CUDA state. This addresses the concern from the previous review.


101-136: LGTM - Torch query function properly preserves device state.

The function captures current_device at line 105 and restores it in the finally block, ensuring no side effects on global CUDA device selection.


12-41: NVML query is correctly implemented.

The function properly initializes/shuts down NVML in a try/finally block and handles the bytes-to-string conversion for device names. Since it doesn't use torch.cuda.set_device, no device restoration is needed.


139-158: Well-structured fallback chain for GPU info retrieval.

The priority order (NVML → ROCm → Torch) is sensible, with NVML providing the most detailed information. Debug-level logging for failures is appropriate since these are expected on systems without specific GPU backends.

src/keep_gpu/mcp/server.py (5)

42-71: LGTM - Session creation is well implemented.

The method correctly handles optional job_id, checks for duplicates, and properly initializes the controller before storing the session.


73-88: Stop logic is correctly implemented.

The method properly handles both single-job and all-jobs cases, using pop() to atomically remove and retrieve sessions before releasing them.


90-104: Status serialization bug is fixed.

The method now correctly excludes the non-serializable controller field by only including params in the response, addressing the critical issue from previous reviews.


111-116: Shutdown handler is appropriately defensive.

The broad exception catch is intentional here to avoid noisy errors during interpreter teardown when resources may already be partially cleaned up.


119-137: Request handler correctly implements JSON-RPC dispatch.

The function properly routes methods, handles errors, and returns standardized responses. The static analysis suggestions (TRY300/TRY301) are style preferences that don't affect correctness.

@Wangmerlyn Wangmerlyn merged commit 1dd2b1f into main Dec 9, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant