-
Notifications
You must be signed in to change notification settings - Fork 2
Add a Torc MCP server #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an MCP (Model Context Protocol) server for Torc that enables AI assistants like Claude and GitHub Copilot to interact with workflows, plus an AI-powered failure recovery system via the torc watch command.
Key changes:
- New
torc-mcp-serverbinary exposing workflow management tools via MCP - New
torc watchcommand for AI-driven autonomous failure recovery using Claude - Failure pattern caching system to reduce API costs
- Comprehensive documentation for both features
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
torc-mcp-server/src/tools.rs |
Implements 9 MCP tools for workflow operations (status, logs, restart, etc.) |
torc-mcp-server/src/server.rs |
MCP server setup with tool registration and parameter definitions |
torc-mcp-server/src/main.rs |
Binary entry point with stdio transport for MCP communication |
torc-mcp-server/Cargo.toml |
Dependencies for the MCP server package |
src/client/watch/watcher.rs |
Main watch loop monitoring workflows and handling failures |
src/client/watch/recovery.rs |
Recovery action execution (restart, adjust resources, cancel, skip) |
src/client/watch/claude_client.rs |
Claude API integration for failure diagnosis |
src/client/watch/failure_cache.rs |
SQLite-based caching of failure patterns and diagnoses |
src/client/watch/audit.rs |
JSON Lines audit logging of all AI decisions |
src/main.rs |
Integration of Watch command into CLI |
src/cli.rs |
Watch command argument parsing |
src/config/client.rs |
Configuration structure for watch settings |
docs/src/tutorials/mcp-server.md |
Tutorial for setting up and using the MCP server |
docs/src/tutorials/ai-failure-recovery.md |
Tutorial for autonomous failure recovery |
Cargo.toml |
Added torc-mcp-server workspace member and rmcp/schemars dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let response = default_api::list_jobs( | ||
| &self.config, | ||
| self.workflow_id, | ||
| Some(JobStatus::Failed), | ||
| None, // needs_file_id | ||
| None, // upstream_job_id | ||
| None, // offset | ||
| Some(10000), // limit |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded limit of 10000 jobs may silently truncate results. This appears in multiple functions that list jobs. Consider documenting this limitation or implementing pagination.
0d04280 to
5776767
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| match default_api::manage_status_change( | ||
| config, | ||
| *job_id, | ||
| JobStatus::Canceled, | ||
| 0, // run_id |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the cancel_jobs function, the run_id is hardcoded to 0. This should likely be the actual run_id for the job. Consider fetching the current run_id from the workflow or job status, or document why 0 is the correct value here.
| // Reset job status to uninitialized | ||
| default_api::manage_status_change( | ||
| config, | ||
| job_id, | ||
| JobStatus::Uninitialized, | ||
| 0, // run_id | ||
| None, | ||
| ) | ||
| .map_err(|e| format!("Failed to reset job status: {}", e))?; | ||
|
|
||
| // Get workflow ID to reinitialize | ||
| let job = | ||
| default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?; | ||
|
|
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded run_id of 0 in manage_status_change may not be correct. When resetting a job to Uninitialized or setting to Canceled, the run_id should typically be the current run_id of the workflow. Consider fetching the workflow status first to get the correct run_id.
| // Reset job status to uninitialized | |
| default_api::manage_status_change( | |
| config, | |
| job_id, | |
| JobStatus::Uninitialized, | |
| 0, // run_id | |
| None, | |
| ) | |
| .map_err(|e| format!("Failed to reset job status: {}", e))?; | |
| // Get workflow ID to reinitialize | |
| let job = | |
| default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?; | |
| // Get job details to obtain the correct run_id and workflow_id | |
| let job = | |
| default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?; | |
| // Reset job status to uninitialized for the current run | |
| default_api::manage_status_change( | |
| config, | |
| job_id, | |
| JobStatus::Uninitialized, | |
| job.run_id, // use the job's current run_id | |
| None, | |
| ) | |
| .map_err(|e| format!("Failed to reset job status: {}", e))?; |
| default_api::manage_status_change( | ||
| config, | ||
| job_id, | ||
| JobStatus::Canceled, | ||
| 0, // run_id |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded run_id of 0 in cancel_job may not be correct. Consider fetching the current workflow run_id before calling manage_status_change to ensure the correct run is canceled.
| default_api::manage_status_change( | |
| config, | |
| job_id, | |
| JobStatus::Canceled, | |
| 0, // run_id | |
| // Get the job to find workflow_id | |
| let job = | |
| default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?; | |
| // Get the workflow status to find the current run_id | |
| let workflow_status = default_api::get_workflow_status(config, job.workflow_id) | |
| .map_err(|e| format!("Failed to get workflow status: {}", e))?; | |
| default_api::manage_status_change( | |
| config, | |
| job_id, | |
| JobStatus::Canceled, | |
| workflow_status.run_id, |
| let spec_json = params.spec_json; | ||
| let user = params | ||
| .user | ||
| .unwrap_or_else(|| std::env::var("USER").unwrap_or_else(|_| "unknown".to_string())); |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using unwrap_or_else with a closure that always returns the same string is inefficient. Use unwrap_or("unknown") instead since the string is constant and doesn't depend on the error.
| .unwrap_or_else(|| std::env::var("USER").unwrap_or_else(|_| "unknown".to_string())); | |
| .unwrap_or_else(|| std::env::var("USER").unwrap_or("unknown".to_string())); |
torc watchCLI command to automatically recover from errors like OOM, walltime timeouts, etc. Fixes Workflow watchdog command #55