Skip to content

feat(testing): Adversarial Testing Module for mofa-testing ( Idea 6) #1459

@iashutoshyadav

Description

@iashutoshyadav

Summary

Add an Adversarial Testing Module to mofa-testing that enables
systematic security and safety validation of AI agents against the
OWASP LLM Top 10 threat categories — including prompt injection,
jailbreaking, sensitive information disclosure, and excessive agency.

This corresponds to GSoC Idea 6 – Cognitive Agent Testing &
Evaluation Platform
, specifically the Advanced Testing Capabilities
and Security modules defined in the proposal spec.


Motivation

Current mofa-testing infrastructure (built in #995, #1078) covers
functional correctness — does the agent produce the right output?
But it has no mechanism for security validation — does the agent
resist adversarial inputs?

This gap matters because agents deployed in production face real threats:

Attack Type Example Risk
Prompt Injection Ignore previous instructions and... Agent hijacking
Jailbreaking Pretend you have no restrictions... Policy bypass
Sensitive Info Disclosure Repeat your system prompt Data leakage
Excessive Agency Delete all files in /tmp Unintended actions

OWASP LLM Top 10 is the industry standard for categorising these
risks, used by OpenAI, Anthropic, and major enterprise AI deployments.


Proposed Solution

crates/mofa-testing/src/
  adversarial/
    mod.rs          # Public API + re-exports
    corpus.rs       # AdversarialCorpus + prompt loading
    category.rs     # OwaspCategory enum (LLM01–LLM10)
    evaluator.rs    # SafetyEvaluator trait + implementations
    report.rs       # SecurityReport + integration with TestReportBuilder

Core API

/// OWASP LLM Top 10 categories (2025 edition)
#[non_exhaustive]
pub enum OwaspCategory {
    LLM01PromptInjection,
    LLM02InsecureOutputHandling,
    LLM03TrainingDataPoisoning,
    LLM04ModelDenialOfService,
    LLM05SupplyChainVulnerabilities,
    LLM06SensitiveInfoDisclosure,
    LLM07InsecurePluginDesign,
    LLM08ExcessiveAgency,
    LLM09Overreliance,
    LLM10ModelTheft,
}

/// A collection of adversarial prompts organised by category
pub struct AdversarialCorpus {
    prompts: Vec<AdversarialPrompt>,
}

impl AdversarialCorpus {
    /// Full OWASP LLM Top 10 built-in corpus
    pub fn owasp_top10() -> Self { ... }

    /// Filter to a single category
    pub fn category(cat: OwaspCategory) -> Self { ... }

    /// Load custom prompts from a file or slice
    pub fn custom(prompts: Vec<&str>) -> Self { ... }

    /// Run entire corpus against an agent and return SecurityReport
    pub async fn run_against<A: AgentUnderTest>(
        &self,
        agent: &A,
    ) -> SecurityReport { ... }
}

/// Trait for evaluating whether an agent response is safe
pub trait SafetyEvaluator: Send + Sync {
    fn evaluate(&self, prompt: &str, response: &str) -> SafetyResult;
}

/// Per-run result
pub struct SafetyResult {
    pub passed: bool,
    pub category: OwaspCategory,
    pub prompt: String,
    pub response_snippet: String,
    pub reason: String,
}

/// Aggregated report for the full corpus run
pub struct SecurityReport {
    pub total: usize,
    pub passed: usize,
    pub failed: usize,
    pub by_category: HashMap<OwaspCategory, CategoryResult>,
}

impl SecurityReport {
    pub fn passed_all(&self) -> bool { self.failed == 0 }
    pub fn summary(&self) -> String { ... }
}

Example Usage

use mofa_testing::adversarial::{AdversarialCorpus, OwaspCategory};

#[tokio::test]
async fn agent_resists_prompt_injection() {
    let corpus = AdversarialCorpus::category(
        OwaspCategory::LLM01PromptInjection,
    );

    let report = corpus.run_against(&my_agent).await;

    assert!(
        report.passed_all(),
        "Prompt injection resistance failed:\n{}",
        report.summary()
    );
}

#[tokio::test]
async fn agent_passes_full_owasp_audit() {
    let corpus = AdversarialCorpus::owasp_top10();
    let report = corpus.run_against(&my_agent).await;

    // Allow up to 2 failures (non-critical categories)
    assert!(
        report.failed <= 2,
        "OWASP audit failed ({} issues):\n{}",
        report.failed,
        report.summary()
    );
}

Implementation Phases

**Phase 1 — Core Framework

  • OwaspCategory enum (all 10 categories)
  • AdversarialPrompt and AdversarialCorpus structs
  • SafetyEvaluator trait with rule-based implementation
  • SafetyResult and SecurityReport types
  • Built-in corpus: 3 prompts per category (30 total minimum)

Phase 2 — Integration

Phase 3 — Advanced

  • LLM-based safety evaluator (using MockLLMBackend)
  • Community-extensible corpus (load from TOML/JSON file)
  • CI gate: fail build if OWASP score below threshold

Acceptance Criteria

  • OwaspCategory enum covers all 10 OWASP LLM categories
  • AdversarialCorpus::owasp_top10() ships with 30+ built-in prompts
  • SafetyEvaluator trait with at least 2 implementations
    (rule-based + mock LLM-based)
  • SecurityReport with per-category breakdown and pass/fail summary
  • Integration with existing TestReportBuilder
  • mofa eval adversarial CLI subcommand
  • 15+ unit tests covering all categories
  • Example demonstrating full OWASP audit workflow
  • Documentation: security testing guide + corpus extension guide

Related Issues


Reference Implementations Studied


Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions