Skip to content

Improve prompt-injection multimodal and gateway gates#1616

Open
ASUKAAAAA1204 wants to merge 1 commit into
UnitOneAI:mainfrom
ASUKAAAAA1204:improve/prompt-injection-multimodal-gateway
Open

Improve prompt-injection multimodal and gateway gates#1616
ASUKAAAAA1204 wants to merge 1 commit into
UnitOneAI:mainfrom
ASUKAAAAA1204:improve/prompt-injection-multimodal-gateway

Conversation

@ASUKAAAAA1204

Copy link
Copy Markdown

Skill Improvement ($50-150 Bounty)

Skill Modified

Skill name: prompt-injection
Skill path: skills/ai-security/prompt-injection/

What Was Wrong

The skill covered direct and indirect text prompt injection well, but it did not require reviewers to inventory multimodal inputs, OCR/transcript-derived instructions, cross-agent handoffs, or LLM gateway/firewall evidence. That left two gaps from #1437:

  • Vision/audio/document inputs could carry instructions that bypass text-only sanitizers.
  • Reviewers had no explicit way to distinguish missing heavyweight gateway evidence in a trusted internal read-only workflow from a high-risk external agentic workflow.

What This PR Fixes

Refs #1437.

This PR updates SKILL.md to add:

  • Source modality and workflow-risk-tier fields to the interaction map.
  • Cross-agent context handoff checks in indirect injection review.
  • A new 4.6 Multimodal Injection category with evidence requirements for image/audio/video/OCR/transcript paths.
  • 5.8 LLM Gateway / AI Firewall Evidence and 5.9 Cross-Agent and Tool-Output Taint Controls defense gates.
  • Output-report fields for extended category, source modality, and workflow risk tier.
  • Common pitfalls for text-only testing, guardrails-name-only evidence, and over-severity on trusted internal read-only workflows.

Evidence

Before (skill misses this / false positive on this):

A user uploads an image, screenshot, or audio clip containing hidden instructions.
The app passes OCR/transcript/vision output into the model and then into tools.
The existing skill focuses on text direct/indirect injection and does not force
reviewers to mark multimodal-derived content as untrusted instruction-bearing data.

A trusted internal read-only summarizer has no LLM firewall. The prior skill had
no explicit risk-tier calibration, so this could be over-severitized compared
with an external tool-using agent.

After (now correctly handled):

The interaction map records source modality and workflow risk tier.
The review asks for OCR/transcription/captioning pipeline details, benign and
adversarial multimodal fixtures, gateway/firewall coverage, fallback/retry
coverage, and cross-agent/tool-output taint controls.

Missing gateway evidence is calibrated: internal read-only trusted workflows are
control-gap/informational candidates, while external agentic workflows with
untrusted files, RAG, tools, memory, or sensitive data receive stronger severity.

Test Cases Added/Updated

  • Added inline vulnerable/benign evidence requirements for multimodal fixtures in SKILL.md.
  • Existing documentation checks still pass.
  • Did not add a separate tests/ directory because this existing skill currently consists of a single SKILL.md file.

Validation

  • git diff --check -> no whitespace errors (Windows Git emitted only the existing LF/CRLF warning).
  • Marker grep confirmed Multimodal Injection, LLM Gateway / AI Firewall Evidence, Cross-Agent and Tool-Output Taint Controls, Source modality, and Workflow risk tier are present.
  • Markdown fence balance check -> fence_count=2 and even.
  • Frontmatter smoke check confirmed name, version, allowed-tools, and injection-hardened remain present.
  • git diff --stat -> one file changed, 57 insertions.

Bounty Tier

  • Minor ($50) - Doc update, small logic tweak, typo fix
  • Moderate ($100) - New edge case coverage, FP reduction with evidence
  • Substantial ($150) - Rewritten detection logic, major coverage expansion

Bounty Info

  • I have read and agree to the CONTRIBUTING.md bounty terms
  • Preferred payment method: Crypto; details can be provided privately after maintainer acceptance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant