Add a Torc MCP server #59

daniel-thom · 2025-12-22T20:01:02Z

Add an MCP server so that users can interact with the torc server through AIs with Claude Code, Copilot, etc.
Add a torc watch CLI command to automatically recover from errors like OOM, walltime timeouts, etc. Fixes Workflow watchdog command #55

Test the watch command on an HPC

Copilot

Pull request overview

This PR adds an MCP (Model Context Protocol) server for Torc that enables AI assistants like Claude and GitHub Copilot to interact with workflows, plus an AI-powered failure recovery system via the torc watch command.

Key changes:

New torc-mcp-server binary exposing workflow management tools via MCP
New torc watch command for AI-driven autonomous failure recovery using Claude
Failure pattern caching system to reduce API costs
Comprehensive documentation for both features

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`torc-mcp-server/src/tools.rs`	Implements 9 MCP tools for workflow operations (status, logs, restart, etc.)
`torc-mcp-server/src/server.rs`	MCP server setup with tool registration and parameter definitions
`torc-mcp-server/src/main.rs`	Binary entry point with stdio transport for MCP communication
`torc-mcp-server/Cargo.toml`	Dependencies for the MCP server package
`src/client/watch/watcher.rs`	Main watch loop monitoring workflows and handling failures
`src/client/watch/recovery.rs`	Recovery action execution (restart, adjust resources, cancel, skip)
`src/client/watch/claude_client.rs`	Claude API integration for failure diagnosis
`src/client/watch/failure_cache.rs`	SQLite-based caching of failure patterns and diagnoses
`src/client/watch/audit.rs`	JSON Lines audit logging of all AI decisions
`src/main.rs`	Integration of Watch command into CLI
`src/cli.rs`	Watch command argument parsing
`src/config/client.rs`	Configuration structure for watch settings
`docs/src/tutorials/mcp-server.md`	Tutorial for setting up and using the MCP server
`docs/src/tutorials/ai-failure-recovery.md`	Tutorial for autonomous failure recovery
`Cargo.toml`	Added torc-mcp-server workspace member and rmcp/schemars dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/client/watch/failure_cache.rs

torc-mcp-server/src/tools.rs

src/client/watch/failure_cache.rs

torc-mcp-server/src/tools.rs

Copilot · 2025-12-22T20:06:07Z

src/client/watch/watcher.rs

+        let response = default_api::list_jobs(
+            &self.config,
+            self.workflow_id,
+            Some(JobStatus::Failed),
+            None,        // needs_file_id
+            None,        // upstream_job_id
+            None,        // offset
+            Some(10000), // limit


The hardcoded limit of 10000 jobs may silently truncate results. This appears in multiple functions that list jobs. Consider documenting this limitation or implementing pagination.

torc-mcp-server/src/tools.rs

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/config/client.rs

Copilot · 2025-12-22T23:20:53Z

torc-mcp-server/src/tools.rs

+        match default_api::manage_status_change(
+            config,
+            *job_id,
+            JobStatus::Canceled,
+            0, // run_id


In the cancel_jobs function, the run_id is hardcoded to 0. This should likely be the actual run_id for the job. Consider fetching the current run_id from the workflow or job status, or document why 0 is the correct value here.

Copilot · 2025-12-22T23:20:53Z

src/client/watch/recovery.rs

+    // Reset job status to uninitialized
+    default_api::manage_status_change(
+        config,
+        job_id,
+        JobStatus::Uninitialized,
+        0, // run_id
+        None,
+    )
+    .map_err(|e| format!("Failed to reset job status: {}", e))?;
+
+    // Get workflow ID to reinitialize
+    let job =
+        default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;
+


The hardcoded run_id of 0 in manage_status_change may not be correct. When resetting a job to Uninitialized or setting to Canceled, the run_id should typically be the current run_id of the workflow. Consider fetching the workflow status first to get the correct run_id.

Suggested change

// Reset job status to uninitialized

default_api::manage_status_change(

config,

job_id,

JobStatus::Uninitialized,

0, // run_id

None,

)

.map_err(|e| format!("Failed to reset job status: {}", e))?;

// Get workflow ID to reinitialize

let job =

default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;

// Get job details to obtain the correct run_id and workflow_id

let job =

default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;

// Reset job status to uninitialized for the current run

default_api::manage_status_change(

config,

job_id,

JobStatus::Uninitialized,

job.run_id, // use the job's current run_id

None,

)

.map_err(|e| format!("Failed to reset job status: {}", e))?;

Copilot · 2025-12-22T23:20:53Z

src/client/watch/recovery.rs

+    default_api::manage_status_change(
+        config,
+        job_id,
+        JobStatus::Canceled,
+        0, // run_id


The hardcoded run_id of 0 in cancel_job may not be correct. Consider fetching the current workflow run_id before calling manage_status_change to ensure the correct run is canceled.

Suggested change

default_api::manage_status_change(

config,

job_id,

JobStatus::Canceled,

0, // run_id

// Get the job to find workflow_id

let job =

default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;

// Get the workflow status to find the current run_id

let workflow_status = default_api::get_workflow_status(config, job.workflow_id)

.map_err(|e| format!("Failed to get workflow status: {}", e))?;

default_api::manage_status_change(

config,

job_id,

JobStatus::Canceled,

workflow_status.run_id,

Copilot · 2025-12-22T23:20:53Z

torc-mcp-server/src/server.rs

+        let spec_json = params.spec_json;
+        let user = params
+            .user
+            .unwrap_or_else(|| std::env::var("USER").unwrap_or_else(|_| "unknown".to_string()));


Using unwrap_or_else with a closure that always returns the same string is inefficient. Use unwrap_or("unknown") instead since the string is constant and doesn't depend on the error.

Suggested change

.unwrap_or_else(|| std::env::var("USER").unwrap_or_else(|_| "unknown".to_string()));

.unwrap_or_else(|| std::env::var("USER").unwrap_or("unknown".to_string()));

daniel-thom requested a review from Copilot December 22, 2025 20:01

Copilot started reviewing on behalf of daniel-thom December 22, 2025 20:01 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

Add a Torc MCP server

5776767

daniel-thom force-pushed the feat/mcp-server branch from 0d04280 to 5776767 Compare December 22, 2025 23:17

daniel-thom requested a review from Copilot December 22, 2025 23:18

Copilot started reviewing on behalf of daniel-thom December 22, 2025 23:18 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

daniel-thom marked this pull request as draft December 22, 2025 23:23

Reduce job limit to 10,000

96686c6

daniel-thom marked this pull request as ready for review December 24, 2025 21:44

daniel-thom merged commit 1bb35c2 into main Dec 24, 2025
1 of 2 checks passed

daniel-thom deleted the feat/mcp-server branch December 24, 2025 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a Torc MCP server #59

Add a Torc MCP server #59

daniel-thom commented Dec 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	.unwrap_or_else(\|\| std::env::var("USER").unwrap_or_else(\|_\| "unknown".to_string()));
	.unwrap_or_else(\|\| std::env::var("USER").unwrap_or("unknown".to_string()));

Add a Torc MCP server #59

Add a Torc MCP server #59

Conversation

daniel-thom commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniel-thom commented Dec 22, 2025 •

edited

Loading