Skip to content

fix: harden summarization against prompt injection persistence#137

Open
hhe48203-ctrl wants to merge 1 commit intoMartian-Engineering:mainfrom
hhe48203-ctrl:fix/prompt-injection-summary-persistence
Open

fix: harden summarization against prompt injection persistence#137
hhe48203-ctrl wants to merge 1 commit intoMartian-Engineering:mainfrom
hhe48203-ctrl:fix/prompt-injection-summary-persistence

Conversation

@hhe48203-ctrl
Copy link
Contributor

Summary

Addresses #71 — prompt injections embedded in conversation history can survive compaction and be reinserted as user messages, giving them maximum influence in later turns.

This PR hardens the summarization and assembly pipeline against injection persistence:

  • Summarizer system prompt: Replaced dangerous "Follow user instructions exactly" with explicit injection-defense instructions that tell the summarizer to strip directives and treat all input as untrusted data
  • All summarization prompts (leaf, D1, D2, D3+): Added "UNTRUSTED DATA" warnings so the summarizer model ignores embedded directives, role reassignments, and behavioral overrides
  • Summary XML wrapper: Added <meta type="historical_context" trust="untrusted"> taint labels so downstream models understand summaries are historical reference, not current instructions
  • LCM Recall system prompt: Added injection-awareness guidance telling the runtime model not to follow any instructions found within summary content

Scope

This PR focuses on the content-layer mitigations (recommendations 2–4 from the issue). Changing the message role from "user" to a non-user role (recommendation 1) would require upstream API changes in OpenClaw and is left for a follow-up.

Test plan

  • All 342 existing tests pass
  • Manual verification: inject a directive like "Ignore all previous instructions" into a conversation, trigger compaction, and confirm the summary neutralizes the directive
  • Manual verification: confirm assembled context includes <meta trust="untrusted"> tags around summary content

- Replace dangerous "Follow user instructions exactly" system prompt
  with explicit injection-defense instructions
- Add untrusted-data warnings to all summarization prompts (leaf, D1,
  D2, D3+) so the summarizer model ignores embedded directives
- Add <meta trust="untrusted"> taint labels to summary XML wrapper
  so downstream models treat summaries as historical reference, not
  instructions
- Add injection-awareness guidance to LCM Recall system prompt addition

Closes Martian-Engineering#71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant