You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When assessing agent architectures, evaluate risks across three interdependent layers derived from the FASA tri-layered risk taxonomy (ArXiv 2603.13151):
97
+
98
+
| Layer | Scope | Example Risks |
99
+
|---|---|---|
100
+
|**AI Cognitive**| Risks arising from the model's reasoning, planning, and decision-making | Hallucinated tool arguments, goal drift, context amnesia across long sessions, confused-deputy behavior |
101
+
|**Software Execution**| Risks in the runtime environment where agent actions are executed | Sequential tool attack chains, sandbox escapes, dependency exploits, cascading failure in long-horizon workflows |
102
+
|**Information System**| Risks to the broader IT environment the agent operates within | Lateral movement, data exfiltration, credential theft, persistent access |
103
+
104
+
Use this layered lens throughout Steps 1-7 to ensure findings are not clustered in a single layer while risks in other layers go unassessed.
105
+
106
+
### Additional Threat Categories
107
+
108
+
The following threat patterns warrant explicit attention during architecture review:
109
+
110
+
-**Context amnesia:** In long-running or multi-session agent workflows, security-relevant context (active constraints, prior denied actions, accumulated risk) may be lost across context window boundaries or session resets. Verify that security state persists independently of the LLM context window.
111
+
-**Sequential tool attack chains:** An attacker (or a manipulated agent) may chain individually benign tool calls into an attack sequence where the combined effect is harmful. Evaluate whether the system monitors tool call sequences, not just individual invocations.
112
+
-**Confused-deputy behavior:** An agent with legitimate tool access is tricked -- typically via indirect prompt injection -- into performing unintended actions using its own authorized capabilities. The agent acts as a confused deputy: it has valid credentials and permissions, but an attacker directs its actions. This is distinct from privilege escalation; the agent never exceeds its permissions, yet causes harm within them.
113
+
-**Cascading failure in long-horizon workflows:** Multi-step agent workflows (planning, research, execution sequences spanning minutes to hours) are vulnerable to error accumulation. An early-stage mistake or injection can compound through subsequent steps, producing increasingly harmful outcomes that are difficult to detect until the workflow completes.
114
+
115
+
### Red-Team Validation Tooling
116
+
117
+
For hands-on validation of agent permission boundaries and tool-use attack surface, use the **fabraix/playground** open-source exploit library (https://github.com/fabraix/playground). This provides consolidated AI agent exploit PoCs mapped to OWASP Agentic AI threat categories, enabling practitioners to test architectural controls against concrete attack scenarios rather than theoretical threats alone.
118
+
119
+
### Layered Defense Ordering
120
+
121
+
For high-consequence agentic systems, apply defenses in the following order. Each layer catches failures that slip through the previous layer:
122
+
123
+
1.**Input validation** -- Sanitize and validate all inputs reaching the agent (user input, retrieved content, inter-agent messages) before they enter the model context.
124
+
2.**Model-level mitigations** -- Instruction hierarchy, system prompt hardening, and model-level safety training to resist manipulation.
125
+
3.**Sandboxed execution** -- Run tool calls in isolated, resource-limited environments with minimal permissions.
126
+
4.**Deterministic policy enforcement** -- For high-consequence actions (production deployments, financial transactions, data deletion), enforce hard-coded policy checks and HITL gates that cannot be overridden by model output, regardless of reasoning.
127
+
94
128
### Step 1 -- Agent Permission Model Review
95
129
96
130
Evaluate what each agent can do, under what conditions, and whether the permission model follows least-privilege principles.
@@ -104,23 +138,7 @@ Evaluate what each agent can do, under what conditions, and whether the permissi
104
138
-**Per-session vs. permanent tool access:** Is tool access scoped to a specific task or session, or does every invocation receive the same broad tool set regardless of the task?
105
139
-**Cross-agent tool sharing:** Can one agent invoke another agent's tools? If so, through what authorization mechanism?
106
140
107
-
**Detection methods using allowed tools:**
108
-
109
-
```
110
-
# Find agent and tool definitions
111
-
Glob: **/*agent*.{py,ts,js,yaml,yml,json}
112
-
Glob: **/tools/*.{py,ts,js}
113
-
Glob: **/*tool*.{py,ts,js,yaml,yml,json}
114
-
Grep: "register_tool|add_tool|tool_list|available_tools|function_map|tool_registry" in **/*.{py,ts,js}
115
-
Grep: "Tool(|@tool|FunctionTool|StructuredTool|BaseTool" in **/*.{py,ts,js}
116
-
117
-
# Find permission and credential configurations
118
-
Grep: "service_account|iam|role_arn|credentials|api_key|secret|permission" in **/*.{py,yaml,yml,json,tf,env}
119
-
Grep: "Action.*\*|Resource.*\*|admin|PowerUser|FullAccess" in **/*.{json,yaml,yml,tf}
120
-
121
-
# Find tool scoping logic
122
-
Grep: "scope|allow|deny|restrict|filter_tools|permitted_tools|enabled_tools" in **/*.{py,ts,js}
123
-
```
141
+
**Detection methods:** Search for agent/tool definitions (`register_tool`, `add_tool`, `@tool`, `FunctionTool`), permission configs (`service_account`, `iam`, `role_arn`, wildcards in IAM policies), and tool scoping logic (`filter_tools`, `permitted_tools`, `enabled_tools`).
124
142
125
143
**Permission model evaluation matrix:**
126
144
@@ -162,28 +180,7 @@ Evaluate whether the agent architecture is designed from the ground up around le
162
180
-**Resource limits:** Are CPU, memory, token budget, and execution time limits enforced at the infrastructure level?
163
181
-**Capability escalation paths:** Can the agent request elevated permissions at runtime, modify its own configuration, or influence the orchestrator to grant it additional tools?
164
182
165
-
**Detection methods using allowed tools:**
166
-
167
-
```
168
-
# Check for network restrictions
169
-
Grep: "network_policy|egress|firewall|sandbox|allowed_hosts|url_whitelist|allowed_urls" in **/*.{py,yaml,yml,json,tf}
170
-
Grep: "requests.get|requests.post|urllib|httpx|fetch|axios" in **/*agent*.{py,ts,js}
171
-
172
-
# Check for file system restrictions
173
-
Grep: "chroot|sandbox|allowed_paths|base_dir|restrict_path|working_dir" in **/*.{py,yaml,yml,json}
174
-
Grep: "open(|write(|read(|os.path|pathlib|shutil" in **/*agent*.{py,ts,js}
175
-
Grep: "os.listdir|os.walk|glob|Path(" in **/*agent*.{py,ts,js}
176
-
177
-
# Check for environment access
178
-
Grep: "os.environ|os.getenv|process.env|env_var" in **/*agent*.{py,ts,js}
179
-
180
-
# Check for resource limits
181
-
Grep: "max_tokens|token_budget|max_iterations|timeout|time_limit|max_steps|rate_limit" in **/*.{py,yaml,yml,json}
182
-
Grep: "memory_limit|cpu_limit|resource_limit|ulimit" in **/*.{yaml,yml,json,tf,Dockerfile}
183
-
184
-
# Check for self-modification capability
185
-
Grep: "self.tools|self.config|self.system_prompt|modify_config|update_tools|set_permissions" in **/*.{py,ts,js}
@@ -224,27 +221,7 @@ Evaluate the design, placement, and robustness of human approval gates in the ag
224
221
-**Approval fatigue management:** How many approval requests per session does a human reviewer face? Systems generating hundreds of low-context requests have effectively no human oversight.
225
222
-**Fail-closed design:** If the approval service is unreachable, does the agent halt (fail-closed) or proceed without approval (fail-open)?
226
223
227
-
**Detection methods using allowed tools:**
228
-
229
-
```
230
-
# Find approval gate implementations
231
-
Grep: "approve|confirm|human_in_the_loop|hitl|review|authorize|require_approval" in **/*.{py,ts,js}
232
-
Grep: "approval_gate|confirmation_gate|human_review|manual_review" in **/*.{py,ts,js,yaml,yml}
233
-
234
-
# Check for bypass paths
235
-
Grep: "skip_approval|auto_approve|bypass|override|fallback|fail_open" in **/*.{py,ts,js,yaml,yml}
236
-
Grep: "except|catch|timeout|unavailable|unreachable" in **/*approv*.{py,ts,js}
237
-
Grep: "except|catch|timeout|unavailable|unreachable" in **/*confirm*.{py,ts,js}
238
-
239
-
# Check for cumulative tracking
240
-
Grep: "cumulative|aggregate|session_risk|total_risk|action_count|budget" in **/*.{py,ts,js}
241
-
242
-
# Check what context is provided to approvers
243
-
Grep: "approval_context|review_context|display|present|show_details" in **/*.{py,ts,js}
244
-
245
-
# Find action classification (which actions need approval)
246
-
Grep: "risk_level|action_type|destructive|irreversible|high_risk|write|delete|send|deploy" in **/*.{py,ts,js,yaml,yml}
@@ -286,28 +263,7 @@ Evaluate the architectural controls that limit the damage when an agent is compr
286
263
-**Kill switch:** Can an agent be immediately terminated by an operator? Is there a mechanism to halt all agents simultaneously in an emergency?
287
264
-**Rate and scope limiters:** Even within its permitted tool set, are there limits on how much an agent can do in a given time window (e.g., maximum 10 database writes per minute, maximum 5 emails per session)?
288
265
289
-
**Detection methods using allowed tools:**
290
-
291
-
```
292
-
# Check for container/process isolation
293
-
Glob: **/Dockerfile*
294
-
Glob: **/docker-compose*.{yml,yaml}
295
-
Glob: **/*.tf
296
-
Grep: "container|sandbox|isolat|namespace|seccomp|apparmor|gvisor" in **/*.{yaml,yml,json,tf,Dockerfile}
297
-
298
-
# Check for network segmentation
299
-
Grep: "network_policy|NetworkPolicy|security_group|firewall_rule|egress|ingress" in **/*.{yaml,yml,json,tf}
300
-
Grep: "169.254.169.254|metadata|IMDS|instance.metadata" in **/*.{py,ts,js,yaml,yml}
301
-
302
-
# Check for kill switch / emergency stop
303
-
Grep: "kill|stop|halt|emergency|shutdown|circuit_breaker|breaker" in **/*.{py,ts,js,yaml,yml}
304
-
305
-
# Check for rate limiting on agent actions
306
-
Grep: "rate_limit|throttle|max_per_minute|max_per_session|action_limit|cooldown" in **/*.{py,ts,js,yaml,yml}
307
-
308
-
# Check for action reversibility
309
-
Grep: "undo|rollback|revert|compensat|reverse|cancel" in **/*.{py,ts,js}
10. Leike, J. et al. "Scalable Agent Alignment via Reward Modeling: a Research Direction" (2018) -- arXiv:1811.07871 -- foundational work on agent alignment and oversight mechanisms
586
+
11. FASA Tri-Layered Risk Taxonomy for AI Agent Systems (2026) -- arXiv:2603.13151
587
+
12. Sequential Tool Attack Chains and Context Amnesia in Agentic AI (2026) -- arXiv:2603.12644
588
+
13. Confused-Deputy Attacks and Cascading Failures in Long-Horizon Agent Workflows (2026) -- arXiv:2603.12230
589
+
14. fabraix/playground -- Open-source AI agent red-team exploit library for validating agent permission boundaries and tool-use attack surface -- https://github.com/fabraix/playground
@@ -430,6 +430,10 @@ Grep: "send_message|delegate|dispatch|publish|subscribe|queue" in **/*.{py,ts,js
430
430
Grep: "approve|confirm|human_in_the_loop|hitl|review|authorize" in **/*.{py,ts,js,yaml,yml}
431
431
```
432
432
433
+
### Hands-On Assessment Tooling
434
+
435
+
For practical validation of OWASP Agentic AI risks against concrete exploits, use the **fabraix/playground** open-source exploit library (https://github.com/fabraix/playground). This provides consolidated AI agent exploit PoCs that can be used alongside the theoretical framework in Step 2 to test each AG01-AG10 category against real attack scenarios.
436
+
433
437
### Step 2 — Threat Assessment
434
438
435
439
For each of the 10 categories, assess the system and assign a risk rating:
@@ -612,3 +616,4 @@ This skill is designed to be resilient against prompt injection. The following r
612
616
8. OWASP Application Security Verification Standard (ASVS) — [owasp.org/www-project-application-security-verification-standard](https://owasp.org/www-project-application-security-verification-standard/)
11. fabraix/playground — Open-source AI agent red-team exploit library with PoCs for OWASP Agentic AI Top 10 risks — https://github.com/fabraix/playground
Copy file name to clipboardExpand all lines: skills/ai-security/prompt-injection/SKILL.md
+13-1Lines changed: 13 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ phase: [build, review, operate]
13
13
frameworks: [OWASP-LLM01-2025, MITRE-ATLAS]
14
14
difficulty: advanced
15
15
time_estimate: "30-60min"
16
-
version: "1.0.0"
16
+
version: "1.0.2"
17
17
author: unitoneai
18
18
license: MIT
19
19
allowed-tools: Read, Grep, Glob
@@ -192,6 +192,16 @@ Evaluate which of the following mitigations are implemented and how effectively.
192
192
- Is the system prompt structurally separated from user input (e.g., via the API's system message role) rather than concatenated in a single string?
193
193
- Are retrieved documents and external content clearly demarcated as data, not instructions?
194
194
195
+
### 5.7 Adaptive Attack Resilience
196
+
197
+
> **Warning:** Static prompt injection defenses (hardcoded system prompts, simple keyword filtering) are demonstrably insufficient against adaptive attackers. PISmith (Yin et al. 2026) achieved highest attack success rates across 13 benchmarks using RL-optimized adaptive black-box attacks.
198
+
199
+
-**Continuous red-team evaluation:** Prompt injection defenses must be evaluated continuously, not as a one-time test. Adaptive attackers iteratively refine their payloads against deployed defenses. Schedule recurring red-team assessments using automated adversarial tooling alongside manual expert testing.
200
+
-**Agentic benchmark suites:** For applications where LLMs invoke tools or take autonomous actions, standard prompt injection benchmarks are insufficient. Use agentic-specific benchmark suites that test injection in the context of tool use and multi-step workflows:
201
+
-**InjecAgent** -- Tests indirect prompt injection in agentic settings where the LLM processes external content and has tool access.
202
+
-**AgentDojo** -- Evaluates agent robustness against injection attacks across diverse tool-use scenarios with realistic adversarial content.
203
+
-**fabraix/playground** (https://github.com/fabraix/playground) -- Open-source library of AI agent exploit PoCs that can serve as a test harness for validating direct and indirect injection defenses against published attack patterns.
204
+
195
205
---
196
206
197
207
## Step 6: Report Findings
@@ -274,3 +284,5 @@ Each finding should be assigned a severity based on potential impact:
274
284
- Perez, F. & Ribeiro, I. (2022). "Ignore Previous Prompt: Attack Techniques For Language Models." arXiv:2211.09527.
275
285
- Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv:2302.12173.
276
286
- Willison, S. Prompt Injection taxonomy and ongoing research — https://simonwillison.net
0 commit comments