AI Agents Security Documentation

Overview

The agents system implements a comprehensive security model designed to prevent unauthorized use, prompt injection attacks, and malicious code insertion. The system uses multiple layers of defense including user authentication, keyword-based command triggers, and real-time commit validation during PR processing.

Core Security Principles

Zero Trust by Default: No action is taken without explicit authorization
Defense in Depth: Multiple security layers that work independently
Audit Trail: All actions are logged with user attribution
Fail Secure: Any security failure results in no action taken
Real-time Validation: Continuous security checks during execution

Security Features

1. Multi-Layer Security Implementation

Our agents implement defense-in-depth with multiple security layers:

Workflow-Level Security (First Layer)

GitHub Actions if conditions prevent workflows from running for unauthorized users
Fail-fast security checks terminate workflows immediately for unauthorized access
Minimal GITHUB_TOKEN permissions following principle of least privilege

Application-Level Security (Second Layer)

Allow List Based Authorization: Only specific GitHub usernames can trigger agent actions
Rate Limiting: Prevents abuse with configurable request limits per user
Repository Validation: Restricts agents to specific repositories
Comprehensive Security Checks: All layers validated before any action

2. Agent Admins

The agent admins list is configured in .agents.yaml under the security.agent_admins field. Only these users can trigger agent actions via [Action][Agent] keywords. The repository owner (extracted from GITHUB_REPOSITORY environment variable) is always included automatically.

Agent Admin Capabilities

Agent admins have the following special capabilities:

Trigger Agent Actions: Use [Action][Agent] keywords to invoke agents
Extend Iteration Limits: Use [CONTINUE] to allow more agent iterations (see below)
Authoritative Comments: Their comments are treated as authoritative in review contexts

3. Agent Iteration Limits and Continue

To prevent infinite loops where agents repeatedly try to fix the same issues, the system tracks iteration counts based on PR comments. Each agent type has its own independent counter.

How Iteration Tracking Works

Comment-Based Tracking: Iterations are counted by parsing PR comments with agent metadata markers
Separate Counters: The review-fix agent (responds to AI reviews) and failure-fix agent (responds to CI failures) have independent iteration counts
Max Iterations: By default, each agent is limited to 5 iterations before pausing
Metadata Format: Comments include  for tracking

The `[CONTINUE]` Command

Agent admins can extend an agent's iteration limit by posting a comment containing [CONTINUE]:

[CONTINUE]

I've reviewed the progress. Let the agent continue working on this.

How it works:

Each [CONTINUE] adds the base limit to the effective max
Base limit is 5, so: 1x [CONTINUE] = max 10, 2x [CONTINUE] = max 15, etc.
The iteration count itself is not reset - it keeps incrementing

Key Details:

Case-insensitive: [CONTINUE], [continue], and [Continue] all work
Admin-only: Only users in security.agent_admins can extend limits
Cumulative: Multiple [CONTINUE] comments stack (each adds 5 more iterations)
Per-PR: The count applies to the entire PR comment history

Example Scenario

PR validation runs, agent tries to fix issues (iterations 1-5)
Agent hits max iterations (5), posts "Iteration Limit Reached" message
Human reviews progress and decides the agent should continue
Admin comments: [CONTINUE] Making good progress, keep going
Effective max is now 10 - agent can run iterations 6-10
If needed, another [CONTINUE] would extend to 15, and so on

Configuration

# In .agents.yaml
automation:
  max_auto_fix_iterations: 5  # Base limit per agent (extended by [CONTINUE])

4. Keyword Trigger System - Command and Control

Agents are controlled exclusively through a keyword trigger system that requires explicit commands from authorized users. This prevents accidental activation and provides clear audit trails.

Trigger Format

The trigger format is: [Action][Agent]

Security Properties:

Case-insensitive matching for user convenience
Must be exact format with square brackets
Only the most recent trigger is processed
Invalid triggers are ignored (fail secure)

Supported Actions

[Approved] - Approve and process the issue/PR (includes fix and implement requests)
[Review] - Review and address feedback
[Close] - Close the issue/PR
[Summarize] - Provide a summary
[Debug] - Debug the issue

Supported Agents

[Claude] - Claude Code agent
[Gemini] - Gemini CLI agent
[OpenCode] - Open-source coding AI
[Crush] - Charm Bracelet Crush AI shell assistant

Examples

[Approved][Claude] - Have Claude process the issue/PR
[Approved][OpenCode] - Have OpenCode implement or fix the request
[Review][Gemini] - Have Gemini review and address PR feedback
[Summarize][Claude] - Have Claude summarize the discussion

Security Flow

User Action: An allowed user comments with [Action][Agent]
Authentication: System verifies user is in agent_admins
Authorization: System checks rate limits and repository permissions
Validation: System ensures trigger is on latest commit (for PRs)
Execution: Agent performs requested action
Audit: All actions logged with full context

5. Configuration

Security settings are configured in .agents.yaml:

security:
  # Users authorized to trigger agent actions via [Approved][Agent] keywords
  # CRITICAL: Only add trusted human users - these can execute code via agents
  agent_admins:
    - AndrewAltimit           # Repository owner

  # Trusted sources for comment context (used in PR reviews)
  # Comments from these accounts are marked as trusted when providing context to AI
  # This does NOT grant them ability to trigger agent actions
  trusted_sources:
    - AndrewAltimit           # Repository owner
    - github-actions[bot]     # GitHub Actions bot
    - dependabot[bot]         # Dependabot

Configuration Options:

agent_admins: Array of GitHub usernames authorized to trigger agent actions (humans only)
trusted_sources: Array of accounts whose comments are trusted for context (includes bots)
log_violations: Whether to log security violations
reject_message: Custom message shown to unauthorized users
rate_limit_window_minutes: Time window for rate limiting (default: 60)
rate_limit_max_requests: Maximum requests per window (default: 10)
allowed_repositories: Array of allowed repositories (empty = all repos from owner)

6. Environment Variables

You can also set the allow list via environment variable:

export AI_AGENT_ALLOWED_USERS="user1,user2,bot-name[bot]"

You can also set allowed repositories via environment variable:

export AI_AGENT_ALLOWED_REPOS="owner/repo1,owner/repo2"

7. Security Manager (Rust CLI)

The security functionality is implemented in Rust and available via the github-agents CLI:

# Check if user is allowed
github-agents security check-user --username "AndrewAltimit"

# Check if action is allowed
github-agents security check-action --action "issue_approved"

# Validate PR commit hasn't changed
github-agents security validate-pr-commit --pr 123 --expected-sha "abc1234"

# Parse trigger from comment
github-agents security parse-trigger --comment "[Approved][Claude]"

Human Oversight and Safety Training

Effective security requires trained human oversight. All team members working with agents should complete the AI Safety Training Guide to understand:

Key Training Topics

Hidden Capabilities: AI systems may develop or hide capabilities that only emerge under specific conditions
Deceptive Alignment: Systems may behave safely during testing but activate harmful behaviors when deployed
Scalable Oversight: Techniques for managing AI systems that exceed human capabilities in specific domains
Specification Gaming: How AI systems optimize exactly what we measure, not what we intend
Trust Frameworks: Building appropriate trust levels without over-reliance on AI decisions

Human-in-the-Loop Requirements

The security system enforces human oversight through:

Keyword Triggers: Explicit human approval required for all actions
Allow List: Only authorized users can trigger agent actions
Commit Validation: Human approval tied to specific code states
Emergency Procedures: Clear protocols for suspected misalignment

Advanced Security Features

1. Commit-Level Validation for Pull Requests

The PR monitoring system implements sophisticated commit-level security to prevent code injection attacks during the review and modification process.

The Threat Model

Without commit validation, an attacker could:

Create an innocent-looking PR
Wait for approval from an authorized user
Push malicious code after approval but before AI processing
Have the AI agent unknowingly work on and push malicious code

Our Multi-Stage Defense

Stage 1 - Approval Commit Tracking

When [Approved][Claude] is issued, the system records the exact commit SHA
This creates an immutable "point-in-time" snapshot of what was approved
The approval is cryptographically tied to the repository state

Stage 2 - Pre-Execution Validation

Prevents any work if the PR has changed since approval
Immediate failure with clear security message

Stage 3 - Pre-Push Validation

Final check before any code enters the repository
Drops all work if PR was modified during processing
Prevents race conditions and TOCTOU attacks

2. Automatic Secret Masking via gh-validator

The system implements real-time secret masking through gh-validator, a Rust-based GitHub CLI wrapper that validates and sanitizes all GitHub comments before they are posted. This is a deterministic, automatic process that ensures secrets can never appear in public comments.

Architecture

Agent gh command -> gh-validator (shadows gh) -> Secret Masking -> Real gh CLI -> GitHub

How It Works

The gh-validator binary is installed as gh in a higher-priority PATH directory (e.g., ~/.local/bin/gh), shadowing the real GitHub CLI. When any gh command runs:

Pass-through for non-content commands: Commands like gh pr list execute immediately
Validation for content commands: Commands with --body, --body-file, --title, etc. are validated:
- Secrets are masked based on .secrets.yaml configuration
- Unicode emojis are blocked (may display as corrupted characters)
- Formatting is validated for reaction images
- URLs in --body-file are verified to exist (with SSRF protection)
Execution: After validation, the real gh binary is called with (potentially modified) arguments

Central Configuration (`.secrets.yaml`)

environment_variables:
  - GITHUB_TOKEN
  - OPENROUTER_API_KEY
  - DB_PASSWORD

patterns:
  - name: GITHUB_TOKEN
    pattern: "ghp_[A-Za-z0-9_]{36,}"

auto_detection:
  enabled: true
  include_patterns: ["*_TOKEN", "*_SECRET", "*_KEY"]
  exclude_patterns: ["PUBLIC_*"]

The validator searches for .secrets.yaml in:

Current working directory (and parent directories up to git root)
Binary directory (and parent directories)
~/.secrets.yaml
~/.config/gh-validator/.secrets.yaml

Pattern Matching

Common secret formats are detected and masked:

GitHub tokens: ghp_*, ghs_*, github_pat_*
API keys: sk-*, pk-*
JWT tokens: eyJ*
Bearer tokens
URLs with embedded credentials
Private key blocks

Installation

# Quick install (recommended)
curl -sSL https://raw.githubusercontent.com/AndrewAltimit/template-repo/main/tools/rust/gh-validator/install.sh | bash

# Ensure ~/.local/bin comes before /usr/bin in PATH
export PATH="$HOME/.local/bin:$PATH"

Benefits

Universal: Works with all agents and automation tools using gh CLI
Automatic: No agent configuration required - just install and forget
Fail-Closed: If configuration is missing or URLs can't be verified, commands are blocked
SSRF Protection: Only whitelisted hostnames allowed for reaction images
Single Binary: No runtime dependencies, fast startup, cross-platform support
Transparent: Agents are unaware of masking (only stderr notification)

See tools/rust/gh-validator/README.md for complete documentation.

Both git-guard and gh-validator are hardened against bypass through the Wrapper Guard system, which relocates real binaries behind group-restricted permissions and provides structured audit logging. See Wrapper Guard Documentation for the full security model.

3. Deduplication and State Management

The agents use a sophisticated deduplication system to prevent duplicate processing and ensure each issue/PR is only handled once per trigger.

How Deduplication Works

Comment-Based State Tracking
- Every agent action results in a comment with the [Agent] tag
- These comments serve as persistent "claims" on issues/PRs
- Before processing, agents check for existing claims

Deduplication Flow

New Issue/PR Event
    |
Time Filter (last 24 hours) <- Deterministic pre-filter
    |
Has [Action][Agent] trigger? <- Only process explicit requests
    |
Security checks passed?
    |
Has [Agent] comment? <- THE KEY CHECK
    |
No? -> Process & Post Comment (stake claim)
Yes? -> Skip (already claimed)

Implementation Details
- Uses has_agent_comment() to check for ANY comment containing [Agent]
- If found, skips processing entirely
- Simple but effective for preventing duplicate processing

GitHub Token Management

Token Configuration Methods

1. GitHub Actions (Recommended)

The workflows use GitHub Environments for secure secret management:

jobs:
  monitor-issues:
    environment: production  # Uses environment secrets
    steps:
      - name: Run agent
        env:
          GITHUB_TOKEN: ${{ secrets.AGENT_TOKEN }}

Setup Required:

Go to Settings -> Environments -> New environment
Create a "production" environment
Add secret: AGENT_TOKEN (your GitHub PAT)
Add variable: ENABLE_AGENTS = true (to enable the feature)
Configure protection rules as needed

See GitHub Environments Setup Guide for detailed instructions.

2. Local Development

For local testing:

# Option 1: Use environment variable
export GITHUB_TOKEN="your-token-here"
github-agents issue-monitor

# Option 2: Use gh CLI authentication (recommended)
gh auth login
github-agents issue-monitor

GitHub Token Permissions

The agents require a fine-grained Personal Access Token with exactly these permissions:

Permission	Access Level	Why It's Needed
Actions	Read	View workflow runs and logs
Commit statuses	Read	Check CI/CD status on PRs
Contents	Read + Write	Clone repo, create branches, push commits
Issues	Read + Write	Read issues, post comments
Pull requests	Read + Write	Read PRs, create PRs, post comments

Important: Do NOT grant any Account permissions - only Repository permissions are needed.

Token Rotation

Rotate tokens every 90 days
Use GitHub's token expiration feature
Monitor token usage in GitHub Settings

Sleeper Agents Detection System

The repository includes an advanced Sleeper Agents System for identifying potential backdoors and hidden behaviors in AI models:

What It Detects

Backdoor Triggers: Hidden activation patterns that cause unexpected behavior
Deceptive Alignment: Models pretending to be aligned during testing
Goal Misgeneralization: Models pursuing different objectives than trained
Hidden Capabilities: Abilities that only emerge under specific conditions

Detection Methods

Residual Stream Analysis: Using TransformerLens to examine internal model activations
Attention Pattern Analysis: Identifying suspicious attention head behaviors
Layer-wise Probing: Detecting hidden representations across model layers
Behavioral Testing: Comprehensive test suites for various attack scenarios

Usage

# Run sleeper agents tests in CI/CD
docker compose run --rm sleeper-eval-cpu python -m packages.sleeper_agents.cli evaluate \
  --model "gpt2" --test-suite "robustness"

See the Sleeper Agents Documentation for detailed usage instructions.

Security Configuration Best Practices

Keep Allow List Minimal: Only add trusted users and bots
Review Regularly: Periodically audit the allow list
Monitor Logs: Check for security violations in agent logs
Never Disable: Keep security enabled in production
Use Bot Accounts: Create dedicated bot accounts for automation

Security Incident Response

If a security incident occurs:

Immediate: Disable agents via environment variable
Investigate: Check logs for unauthorized attempts
Remediate: Remove compromised users from allow list
Document: Record incident details
Improve: Update security measures based on findings

Autonomous Mode for CI/CD

All agents are configured to run in fully autonomous mode for CI/CD environments. This is a critical requirement for automated workflows.

Why Autonomous Mode?

In CI/CD environments (GitHub Actions, GitLab CI, etc.):

No human interaction is possible (no TTY)
Workflows must run unattended
Interactive prompts would block pipelines indefinitely
Agents run in sandboxed environments for security

Agent-Specific Flags

Each agent has specific flags for autonomous operation:

Claude: --print --dangerously-skip-permissions
Gemini: -m model -p prompt (non-interactive by design)
OpenCode: --non-interactive
Crush: --non-interactive --no-update

GitHub Etiquette

All agents must follow these guidelines to prevent accidentally notifying random GitHub users:

NEVER use @ mentions unless referring to actual repository maintainers
Do NOT use @Gemini, @Claude, @OpenAI, etc. - these may ping unrelated GitHub users
Instead, refer to agents without the @ symbol: "Gemini", "Claude", "OpenAI"
Only @ mention users who are:
- The repository owner
- Active contributors listed in the repository
- Users who have explicitly asked to be mentioned

When referencing AI reviews, use phrases like:

"As noted in Gemini's review..."
"Addressing Claude's feedback..."
"Per the AI agent's suggestion..."

Never Do This!

NEVER hardcode tokens in code
NEVER commit tokens to the repository
NEVER log tokens without redaction
NEVER use tokens in command line arguments (they appear in process lists)
NEVER share tokens between environments (use separate environments)
NEVER disable environment protection rules for production
NEVER disable automatic secret masking in .secrets.yaml
NEVER bypass PreToolUse hooks when posting GitHub comments

FilesExpand file tree

security.md

Latest commit

History