Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,29 @@ with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy
- ROCm-only tests carry `@pytest.mark.rocm`; run with `pytest --run-rocm tests/rocm_controller`.
- Markers: `rocm` (needs ROCm stack) and `large_memory` (opt-in locally).

### MCP endpoint (experimental)

- Start a simple JSON-RPC server on stdin/stdout:
```bash
keep-gpu-mcp-server
```
- Example request (one per line):
```json
{"id": 1, "method": "start_keep", "params": {"gpu_ids": [0], "vram": "512MB", "interval": 60, "busy_threshold": 20}}
```
- Methods: `start_keep`, `stop_keep` (optional `job_id`, default stops all), `status` (optional `job_id`), `list_gpus` (basic info).
- Minimal client config (stdio MCP):
```yaml
servers:
keepgpu:
command: ["keep-gpu-mcp-server"]
adapter: stdio
```

## Contributing

Contributions are welcome—especially around ROCm support, platform fallbacks, and scheduler-specific recipes. Open an issue or PR if you hit edge cases on your cluster.
See `docs/contributing.md` for dev setup, test commands, and PR tips.

## Credits

Expand Down
62 changes: 62 additions & 0 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Contributing & Development

Thanks for helping improve KeepGPU! This page collects the key commands and
expectations so you can get productive quickly and avoid surprises in CI.

## Setup

- Clone and install dev extras:
```bash
git clone https://github.com/Wangmerlyn/KeepGPU.git
cd KeepGPU
pip install -e ".[dev]" # add .[rocm] if you need ROCm SMI
```
- Ensure you have the right torch build for your platform (CUDA/ROCm/CPU).
- Optional: install `nvidia-ml-py` (CUDA) or `rocm-smi` (ROCm) for telemetry.

## Tests

- Fast CUDA suite:
```bash
pytest tests/cuda_controller tests/global_controller \
tests/utilities/test_platform_manager.py tests/test_cli_thresholds.py
```
- ROCm-only tests are marked `rocm` and skipped by default; run with:
```bash
pytest --run-rocm tests/rocm_controller
```
- MCP + utilities:
```bash
pytest tests/mcp tests/utilities/test_gpu_info.py
```
- All tests honor markers `rocm` and `large_memory`; avoid enabling
`large_memory` in CI.

## Lint/format

- Run pre-commit hooks locally before pushing:
```bash
pre-commit run --all-files
```

## MCP server (experimental)

- Start: `keep-gpu-mcp-server` (stdin/stdout JSON-RPC)
- Methods: `start_keep`, `stop_keep`, `status`, `list_gpus`
- Example request:
```json
{"id":1,"method":"start_keep","params":{"gpu_ids":[0],"vram":"512MB","interval":60,"busy_threshold":20}}
```

## Pull requests

- Keep changesets focused; small commits are welcome.
- Add/adjust tests for new behavior; skip GPU-specific tests in CI by way of markers.
- Update docs/README when behavior or interfaces change.
- Stick to the existing style (Typer CLI, Rich logging) and keep code paths
simple—avoid over-engineering.

## Support

- Issues/PRs: https://github.com/Wangmerlyn/KeepGPU
- Code of Conduct: see `CODE_OF_CONDUCT.rst`
32 changes: 32 additions & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,38 @@ understand the minimum knobs you need to keep a GPU occupied.
- Fast CUDA checks: `pytest tests/cuda_controller tests/global_controller tests/utilities/test_platform_manager.py tests/test_cli_thresholds.py`
- ROCm-only tests are marked `rocm`; run with `pytest --run-rocm tests/rocm_controller`.

## MCP endpoint (experimental)

For automation clients that speak JSON-RPC (MCP-style), KeepGPU ships a tiny
stdin/stdout server:

```bash
keep-gpu-mcp-server
# each request is a single JSON line; example:
echo '{"id":1,"method":"start_keep","params":{"gpu_ids":[0],"vram":"512MB","interval":60,"busy_threshold":20}}' | keep-gpu-mcp-server
```

Supported methods:
- `start_keep(gpu_ids?, vram?, interval?, busy_threshold?, job_id?)`
- `status(job_id?)`
- `stop_keep(job_id?)` (no job_id stops all)
- `list_gpus()` (basic info)

### Example MCP client config (stdio)

If your agent expects an MCP server definition, a minimal stdio config looks like:

```yaml
servers:
keepgpu:
description: "KeepGPU MCP server"
command: ["keep-gpu-mcp-server"]
adapter: stdio
```

Tools exposed: `start_keep`, `stop_keep`, `status`, `list_gpus`. Each request is
a single JSON line; see above for an example payload.

=== "Editable dev install"
```bash
git clone https://github.com/Wangmerlyn/KeepGPU.git
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ nav:
- Reference:
- CLI Reference: reference/cli.md
- API Reference: reference/api.md
- Project:
- Contributing: contributing.md

plugins:
- search
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ dependencies = [

[project.scripts]
keep-gpu = "keep_gpu.cli:app"
keep-gpu-mcp-server = "keep_gpu.mcp.server:main"

[project.optional-dependencies]
dev = [
Expand Down
157 changes: 157 additions & 0 deletions src/keep_gpu/mcp/server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
"""
Minimal MCP-style JSON-RPC server for KeepGPU.

The server reads JSON lines from stdin and writes JSON responses to stdout.
Supported methods:
- start_keep(gpu_ids, vram, interval, busy_threshold, job_id)
- stop_keep(job_id=None) # None stops all
- status(job_id=None) # None lists all
"""

from __future__ import annotations

import atexit
import json
import sys
import uuid
from dataclasses import dataclass
from typing import Any, Callable, Dict, List, Optional

from keep_gpu.global_gpu_controller.global_gpu_controller import GlobalGPUController
from keep_gpu.utilities.gpu_info import get_gpu_info
from keep_gpu.utilities.logger import setup_logger

logger = setup_logger(__name__)


@dataclass
class Session:
controller: GlobalGPUController
params: Dict[str, Any]


class KeepGPUServer:
def __init__(
self,
controller_factory: Optional[Callable[..., GlobalGPUController]] = None,
) -> None:
self._sessions: Dict[str, Session] = {}
self._controller_factory = controller_factory or GlobalGPUController
atexit.register(self.shutdown)

def start_keep(
self,
gpu_ids: Optional[List[int]] = None,
vram: str = "1GiB",
interval: int = 300,
busy_threshold: int = -1,
job_id: Optional[str] = None,
) -> Dict[str, Any]:
job_id = job_id or str(uuid.uuid4())
if job_id in self._sessions:
raise ValueError(f"job_id {job_id} already exists")

controller = self._controller_factory(
gpu_ids=gpu_ids,
interval=interval,
vram_to_keep=vram,
busy_threshold=busy_threshold,
)
controller.keep()
self._sessions[job_id] = Session(
controller=controller,
params={
"gpu_ids": gpu_ids,
"vram": vram,
"interval": interval,
"busy_threshold": busy_threshold,
},
)
logger.info("Started keep session %s on GPUs %s", job_id, gpu_ids)
return {"job_id": job_id}

def stop_keep(self, job_id: Optional[str] = None) -> Dict[str, Any]:
if job_id:
session = self._sessions.pop(job_id, None)
if session:
session.controller.release()
logger.info("Stopped keep session %s", job_id)
return {"stopped": [job_id]}
return {"stopped": [], "message": "job_id not found"}

stopped_ids = list(self._sessions.keys())
for job_id in stopped_ids:
session = self._sessions.pop(job_id)
session.controller.release()
if stopped_ids:
logger.info("Stopped sessions: %s", stopped_ids)
return {"stopped": stopped_ids}

def status(self, job_id: Optional[str] = None) -> Dict[str, Any]:
if job_id:
session = self._sessions.get(job_id)
if not session:
return {"active": False, "job_id": job_id}
return {
"active": True,
"job_id": job_id,
"params": session.params,
}
return {
"active_jobs": [
{"job_id": jid, "params": sess.params}
for jid, sess in self._sessions.items()
]
Comment on lines 100 to 104

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge status without job_id includes non-serializable controller

The list-all branch of status builds each job entry with asdict(sess), which includes the GlobalGPUController instance. When a client calls status without a job_id after any start_keep has run, main() will try to json.dumps the response and raise TypeError: Object of type GlobalGPUController is not JSON serializable, crashing the MCP server instead of returning active jobs.

Useful? React with 👍 / 👎.

}

def list_gpus(self) -> Dict[str, Any]:
"""Return detailed GPU info (id, name, memory, utilization)."""
infos = get_gpu_info()
return {"gpus": infos}

def shutdown(self) -> None:
try:
self.stop_keep(None)
except Exception: # pragma: no cover - defensive
# Avoid noisy errors during interpreter teardown
return


def _handle_request(server: KeepGPUServer, payload: Dict[str, Any]) -> Dict[str, Any]:
method = payload.get("method")
params = payload.get("params", {}) or {}
req_id = payload.get("id")
try:
if method == "start_keep":
result = server.start_keep(**params)
elif method == "stop_keep":
result = server.stop_keep(**params)
elif method == "status":
result = server.status(**params)
elif method == "list_gpus":
result = server.list_gpus()
else:
raise ValueError(f"Unknown method: {method}")
return {"id": req_id, "result": result}
except Exception as exc: # pragma: no cover - defensive
logger.exception("Request failed")
return {"id": req_id, "error": {"message": str(exc)}}


def main() -> None:
server = KeepGPUServer()
for line in sys.stdin:
line = line.strip()
if not line:
continue
try:
payload = json.loads(line)
response = _handle_request(server, payload)
except Exception as exc:
response = {"error": {"message": str(exc)}}
sys.stdout.write(json.dumps(response) + "\n")
sys.stdout.flush()


if __name__ == "__main__":
main()
Loading