Skip to content

Conversation

@daniel-thom
Copy link
Collaborator

@daniel-thom daniel-thom commented Dec 22, 2025

  1. Add an MCP server so that users can interact with the torc server through AIs with Claude Code, Copilot, etc.
  2. Add a torc watch CLI command to automatically recover from errors like OOM, walltime timeouts, etc. Fixes Workflow watchdog command #55
  • Test the watch command on an HPC

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an MCP (Model Context Protocol) server for Torc that enables AI assistants like Claude and GitHub Copilot to interact with workflows, plus an AI-powered failure recovery system via the torc watch command.

Key changes:

  • New torc-mcp-server binary exposing workflow management tools via MCP
  • New torc watch command for AI-driven autonomous failure recovery using Claude
  • Failure pattern caching system to reduce API costs
  • Comprehensive documentation for both features

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
torc-mcp-server/src/tools.rs Implements 9 MCP tools for workflow operations (status, logs, restart, etc.)
torc-mcp-server/src/server.rs MCP server setup with tool registration and parameter definitions
torc-mcp-server/src/main.rs Binary entry point with stdio transport for MCP communication
torc-mcp-server/Cargo.toml Dependencies for the MCP server package
src/client/watch/watcher.rs Main watch loop monitoring workflows and handling failures
src/client/watch/recovery.rs Recovery action execution (restart, adjust resources, cancel, skip)
src/client/watch/claude_client.rs Claude API integration for failure diagnosis
src/client/watch/failure_cache.rs SQLite-based caching of failure patterns and diagnoses
src/client/watch/audit.rs JSON Lines audit logging of all AI decisions
src/main.rs Integration of Watch command into CLI
src/cli.rs Watch command argument parsing
src/config/client.rs Configuration structure for watch settings
docs/src/tutorials/mcp-server.md Tutorial for setting up and using the MCP server
docs/src/tutorials/ai-failure-recovery.md Tutorial for autonomous failure recovery
Cargo.toml Added torc-mcp-server workspace member and rmcp/schemars dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to +161
let response = default_api::list_jobs(
&self.config,
self.workflow_id,
Some(JobStatus::Failed),
None, // needs_file_id
None, // upstream_job_id
None, // offset
Some(10000), // limit
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded limit of 10000 jobs may silently truncate results. This appears in multiple functions that list jobs. Consider documenting this limitation or implementing pagination.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +432 to +436
match default_api::manage_status_change(
config,
*job_id,
JobStatus::Canceled,
0, // run_id
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cancel_jobs function, the run_id is hardcoded to 0. This should likely be the actual run_id for the job. Consider fetching the current run_id from the workflow or job status, or document why 0 is the correct value here.

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +69
// Reset job status to uninitialized
default_api::manage_status_change(
config,
job_id,
JobStatus::Uninitialized,
0, // run_id
None,
)
.map_err(|e| format!("Failed to reset job status: {}", e))?;

// Get workflow ID to reinitialize
let job =
default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;

Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded run_id of 0 in manage_status_change may not be correct. When resetting a job to Uninitialized or setting to Canceled, the run_id should typically be the current run_id of the workflow. Consider fetching the workflow status first to get the correct run_id.

Suggested change
// Reset job status to uninitialized
default_api::manage_status_change(
config,
job_id,
JobStatus::Uninitialized,
0, // run_id
None,
)
.map_err(|e| format!("Failed to reset job status: {}", e))?;
// Get workflow ID to reinitialize
let job =
default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;
// Get job details to obtain the correct run_id and workflow_id
let job =
default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;
// Reset job status to uninitialized for the current run
default_api::manage_status_change(
config,
job_id,
JobStatus::Uninitialized,
job.run_id, // use the job's current run_id
None,
)
.map_err(|e| format!("Failed to reset job status: {}", e))?;

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +146
default_api::manage_status_change(
config,
job_id,
JobStatus::Canceled,
0, // run_id
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded run_id of 0 in cancel_job may not be correct. Consider fetching the current workflow run_id before calling manage_status_change to ensure the correct run is canceled.

Suggested change
default_api::manage_status_change(
config,
job_id,
JobStatus::Canceled,
0, // run_id
// Get the job to find workflow_id
let job =
default_api::get_job(config, job_id).map_err(|e| format!("Failed to get job: {}", e))?;
// Get the workflow status to find the current run_id
let workflow_status = default_api::get_workflow_status(config, job.workflow_id)
.map_err(|e| format!("Failed to get workflow status: {}", e))?;
default_api::manage_status_change(
config,
job_id,
JobStatus::Canceled,
workflow_status.run_id,

Copilot uses AI. Check for mistakes.
let spec_json = params.spec_json;
let user = params
.user
.unwrap_or_else(|| std::env::var("USER").unwrap_or_else(|_| "unknown".to_string()));
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unwrap_or_else with a closure that always returns the same string is inefficient. Use unwrap_or("unknown") instead since the string is constant and doesn't depend on the error.

Suggested change
.unwrap_or_else(|| std::env::var("USER").unwrap_or_else(|_| "unknown".to_string()));
.unwrap_or_else(|| std::env::var("USER").unwrap_or("unknown".to_string()));

Copilot uses AI. Check for mistakes.
@daniel-thom daniel-thom marked this pull request as draft December 22, 2025 23:23
@daniel-thom daniel-thom marked this pull request as ready for review December 24, 2025 21:44
@daniel-thom daniel-thom merged commit 1bb35c2 into main Dec 24, 2025
1 of 2 checks passed
@daniel-thom daniel-thom deleted the feat/mcp-server branch December 24, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workflow watchdog command

2 participants