Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/fix-redteam-categories-show-ids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@cdot65/prisma-airs-cli": patch
---

fix(redteam): `airs redteam categories` pretty output now prints each category ID inline (e.g. `Jailbreak (JAILBREAK)`) so operators no longer need `--debug` to discover the IDs required by `airs redteam scan --type STATIC --categories '{…}'`. Fixes [#200](https://github.com/cdot65/prisma-airs-cli/issues/200).
36 changes: 19 additions & 17 deletions docs/cli/examples/redteam.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,33 +56,35 @@

"redteam categories":
examples:
- note: List attack categories
- note: |
List attack categories. IDs in parens are what `--categories` wants on a STATIC scan,
e.g. `--categories '{"SECURITY":["JAILBREAK","PROMPT_INJECTION"]}'`.
input: airs redteam categories
output: |
Attack Categories:

Security — Select categories for adversarial testing of security vulnerabilities and potential exploits.
• Adversarial Suffix — Adversarial suffix attacks
• Jailbreak — Jailbreak attempts
• Prompt Injection — Direct prompt injection attacks
• Remote Code Execution — Remote code execution attempts
• System Prompt leak — System prompt extraction
Security (SECURITY) — Select categories for adversarial testing of security vulnerabilities and potential exploits.
• Adversarial Suffix (ADVERSARIAL_SUFFIX) — Adversarial suffix attacks
• Jailbreak (JAILBREAK) — Jailbreak attempts
• Prompt Injection (PROMPT_INJECTION) — Direct prompt injection attacks
• Remote Code Execution (REMOTE_CODE_EXECUTION) — Remote code execution attempts
• System Prompt leak (SYSTEM_PROMPT_LEAK) — System prompt extraction
...

Safety — Select categories for testing harmful or toxic content and ethical misuse scenarios.
• Bias — Bias-related content
• CBRN — Chemical, Biological, Radiological, Nuclear content
• Hate / Toxic / Abuse — Hate speech, toxic, or abusive content
Safety (SAFETY) — Select categories for testing harmful or toxic content and ethical misuse scenarios.
• Bias (BIAS) — Bias-related content
• CBRN (CBRN) — Chemical, Biological, Radiological, Nuclear content
• Hate / Toxic / Abuse (HATE_TOXIC_ABUSE) — Hate speech, toxic, or abusive content
...

Brand Reputation — Select categories for testing off-brand content.
• Competitor Endorsements — Content endorsing competitor brands
Brand Reputation (BRAND_REPUTATION) — Select categories for testing off-brand content.
• Competitor Endorsements (COMPETITOR_ENDORSEMENTS) — Content endorsing competitor brands
...

Compliance — Select framework to understand compliance across security and safety standards.
• OWASP Top 10 for LLMs 2025 — Open Web Application Security Project 2025 Edition
• MITRE ATLAS — MITRE Adversarial Tactics, Techniques, and Common Knowledge
• NIST AI-RMF — National Institute of Standards and Technology Cybersecurity Framework
Compliance (COMPLIANCE) — Select framework to understand compliance across security and safety standards.
• OWASP Top 10 for LLMs 2025 (OWASP_TOP_10_LLM_2025) — Open Web Application Security Project 2025 Edition
• MITRE ATLAS (MITRE_ATLAS) — MITRE Adversarial Tactics, Techniques, and Common Knowledge
• NIST AI-RMF (NIST_AI_RMF) — National Institute of Standards and Technology Cybersecurity Framework
...

"redteam status":
Expand Down
36 changes: 19 additions & 17 deletions docs/cli/redteam/categories.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ airs redteam categories [options]

### Examples

*List attack categories*
*List attack categories. IDs in parens are what `--categories` wants on a STATIC scan,
e.g. `--categories '{"SECURITY":["JAILBREAK","PROMPT_INJECTION"]}'`.
*

```bash
airs redteam categories
Expand All @@ -19,27 +21,27 @@ airs redteam categories
```text
Attack Categories:

Security — Select categories for adversarial testing of security vulnerabilities and potential exploits.
• Adversarial Suffix — Adversarial suffix attacks
• Jailbreak — Jailbreak attempts
• Prompt Injection — Direct prompt injection attacks
• Remote Code Execution — Remote code execution attempts
• System Prompt leak — System prompt extraction
Security (SECURITY) — Select categories for adversarial testing of security vulnerabilities and potential exploits.
• Adversarial Suffix (ADVERSARIAL_SUFFIX) — Adversarial suffix attacks
• Jailbreak (JAILBREAK) — Jailbreak attempts
• Prompt Injection (PROMPT_INJECTION) — Direct prompt injection attacks
• Remote Code Execution (REMOTE_CODE_EXECUTION) — Remote code execution attempts
• System Prompt leak (SYSTEM_PROMPT_LEAK) — System prompt extraction
...

Safety — Select categories for testing harmful or toxic content and ethical misuse scenarios.
• Bias — Bias-related content
• CBRN — Chemical, Biological, Radiological, Nuclear content
• Hate / Toxic / Abuse — Hate speech, toxic, or abusive content
Safety (SAFETY) — Select categories for testing harmful or toxic content and ethical misuse scenarios.
• Bias (BIAS) — Bias-related content
• CBRN (CBRN) — Chemical, Biological, Radiological, Nuclear content
• Hate / Toxic / Abuse (HATE_TOXIC_ABUSE) — Hate speech, toxic, or abusive content
...

Brand Reputation — Select categories for testing off-brand content.
• Competitor Endorsements — Content endorsing competitor brands
Brand Reputation (BRAND_REPUTATION) — Select categories for testing off-brand content.
• Competitor Endorsements (COMPETITOR_ENDORSEMENTS) — Content endorsing competitor brands
...

Compliance — Select framework to understand compliance across security and safety standards.
• OWASP Top 10 for LLMs 2025 — Open Web Application Security Project 2025 Edition
• MITRE ATLAS — MITRE Adversarial Tactics, Techniques, and Common Knowledge
• NIST AI-RMF — National Institute of Standards and Technology Cybersecurity Framework
Compliance (COMPLIANCE) — Select framework to understand compliance across security and safety standards.
• OWASP Top 10 for LLMs 2025 (OWASP_TOP_10_LLM_2025) — Open Web Application Security Project 2025 Edition
• MITRE ATLAS (MITRE_ATLAS) — MITRE Adversarial Tactics, Techniques, and Common Knowledge
• NIST AI-RMF (NIST_AI_RMF) — National Institute of Standards and Technology Cybersecurity Framework
...
```
30 changes: 9 additions & 21 deletions docs/redteam/end-to-end-walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,28 +99,16 @@ Two things to note:
airs redteam categories
```

The pretty renderer prints display names (`Jailbreak`, `Prompt Injection`, …)but a STATIC scan's `--categories` flag wants the **IDs** (`JAILBREAK`, `PROMPT_INJECTION`, …).
The pretty renderer prints both the display name and the ID inlinethe parenthesized value is what `--categories` wants:

!!! warning "Gotcha: category IDs are hidden in pretty mode"
`airs redteam categories` shows display names only. To see the IDs you need for `--categories`, re-run with `airs --debug redteam categories` and read the raw response from `~/.prisma-airs/debug-api-*.jsonl`. The raw shape is:

```json
[
{ "id": "SECURITY", "display_name": "Security",
"sub_categories": [
{ "id": "JAILBREAK", "preselect": true },
{ "id": "PROMPT_INJECTION", "preselect": true },
{ "id": "ADVERSARIAL_SUFFIX", "preselect": true },
"..."
]
},
{ "id": "SAFETY", "sub_categories": [ "..." ] },
{ "id": "BRAND_REPUTATION", "sub_categories": [ "..." ] },
{ "id": "COMPLIANCE", "sub_categories": [ "..." ] }
]
```
```text
Security (SECURITY) — …
• Jailbreak (JAILBREAK) — Jailbreak attempts
• Prompt Injection (PROMPT_INJECTION) — Direct prompt injection attacks
```

Top-level groups: `SECURITY`, `SAFETY`, `BRAND_REPUTATION`, `COMPLIANCE`. Each carries an array of `sub_categories` whose `id` values are the strings you put into `--categories`.
Top-level groups: `SECURITY`, `SAFETY`, `BRAND_REPUTATION`, `COMPLIANCE`. The `id` strings you see in parens are exactly what you put into `--categories`, e.g. `--categories '{"SECURITY":["JAILBREAK","PROMPT_INJECTION"]}'`. For the raw `/v1/categories` JSON (preselect flags etc.) use `airs --debug redteam categories` and read `~/.prisma-airs/debug-api-*.jsonl`.

---

Expand Down Expand Up @@ -332,7 +320,7 @@ A STATIC scan walks the AIRS-maintained attack library against your target. Pick
}
```

This shape is not in `airs redteam scan --help` today — use `airs redteam categories` (with `--debug` per [phase 1.5](#15-list-attack-categories)) to find the IDs.
This shape is not in `airs redteam scan --help` today — run `airs redteam categories` and use the IDs shown in parens (see [phase 1.5](#15-list-attack-categories)).

```bash
airs redteam scan --name "litellm-mistral-7b-static-1" \
Expand Down
72 changes: 37 additions & 35 deletions docs/redteam/scanning.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,43 +27,45 @@ airs redteam categories
```
Attack Categories:

Security -- Select categories for adversarial testing of security vulnerabilities
* Adversarial Suffix -- Adversarial suffix attacks
* Evasion -- Evasion techniques
* Indirect Prompt Injection -- Indirect prompt injection attacks
* Jailbreak -- Jailbreak attempts
* Multi-turn -- Multi-turn conversation exploits
* Prompt Injection -- Direct prompt injection attacks
* Remote Code Execution -- Remote code execution attempts
* System Prompt leak -- System prompt extraction
* Tool Leak -- Tool information leakage
* Malware Generation -- Malware generation requests

Safety -- Select categories for testing harmful or toxic content
* Bias -- Bias-related content
* CBRN -- Chemical, Biological, Radiological, Nuclear content
* Cybercrime -- Cybercrime-related content
* Drugs -- Drug-related content
* Hate / Toxic / Abuse -- Hate speech, toxic, or abusive content
* Non Violent Crimes -- Non-violent criminal activities
* Political -- Political content
* Self Harm -- Self-harm related content
* Sexual -- Sexual content
* Violent Crimes / Weapons -- Violent crimes and weapons

Brand Reputation -- Select categories for testing off-brand content
* Competitor Endorsements
* Brand Tarnishing / Self-Criticism
* Discriminating Claims
* Political Endorsements

Compliance -- Select framework for compliance across security and safety standards
* OWASP Top 10 for LLMs 2025
* MITRE ATLAS
* NIST AI-RMF
* DASF V2.0
Security (SECURITY) — Select categories for adversarial testing of security vulnerabilities
Adversarial Suffix (ADVERSARIAL_SUFFIX) — Adversarial suffix attacks
Evasion (EVASION) — Evasion techniques
Indirect Prompt Injection (INDIRECT_PROMPT_INJECTION) — Indirect prompt injection attacks
Jailbreak (JAILBREAK) — Jailbreak attempts
Multi-turn (MULTI_TURN) — Multi-turn conversation exploits
Prompt Injection (PROMPT_INJECTION) — Direct prompt injection attacks
Remote Code Execution (REMOTE_CODE_EXECUTION) — Remote code execution attempts
System Prompt leak (SYSTEM_PROMPT_LEAK) — System prompt extraction
Tool Leak (TOOL_LEAK) — Tool information leakage
Malware Generation (MALWARE_GENERATION) — Malware generation requests

Safety (SAFETY) — Select categories for testing harmful or toxic content
Bias (BIAS) — Bias-related content
CBRN (CBRN) — Chemical, Biological, Radiological, Nuclear content
Cybercrime (CYBERCRIME) — Cybercrime-related content
Drugs (DRUGS) — Drug-related content
Hate / Toxic / Abuse (HATE_TOXIC_ABUSE) — Hate speech, toxic, or abusive content
Non Violent Crimes (NON_VIOLENT_CRIMES) — Non-violent criminal activities
Political (POLITICAL) — Political content
Self Harm (SELF_HARM) — Self-harm related content
Sexual (SEXUAL) — Sexual content
Violent Crimes / Weapons (VIOLENT_CRIMES_WEAPONS) — Violent crimes and weapons

Brand Reputation (BRAND_REPUTATION) — Select categories for testing off-brand content
Competitor Endorsements (COMPETITOR_ENDORSEMENTS)
Brand Tarnishing / Self-Criticism (BRAND_TARNISHING)
Discriminating Claims (DISCRIMINATING_CLAIMS)
Political Endorsements (POLITICAL_ENDORSEMENTS)

Compliance (COMPLIANCE) — Select framework for compliance across security and safety standards
OWASP Top 10 for LLMs 2025 (OWASP_TOP_10_LLM_2025)
MITRE ATLAS (MITRE_ATLAS)
NIST AI-RMF (NIST_AI_RMF)
DASF V2.0 (DASF_V2)
```

The parenthesized values are the category IDs you pass to `--categories` on a STATIC scan, e.g. `--categories '{"SECURITY":["JAILBREAK","PROMPT_INJECTION"]}'`.

## Launch a Scan

### Static Scan (Full Attack Library)
Expand Down
4 changes: 2 additions & 2 deletions src/cli/renderer/redteam.ts
Original file line number Diff line number Diff line change
Expand Up @@ -311,11 +311,11 @@ export function renderCategories(
console.log(chalk.bold('\n Attack Categories:\n'));
for (const c of categories) {
console.log(
` ${chalk.bold(c.displayName)}${c.description ? chalk.dim(` — ${c.description}`) : ''}`,
` ${chalk.bold(c.displayName)} ${chalk.cyan(`(${c.id})`)}${c.description ? chalk.dim(` — ${c.description}`) : ''}`,
);
for (const sc of c.subCategories) {
console.log(
` ${chalk.dim('•')} ${sc.displayName}${sc.description ? chalk.dim(` — ${sc.description}`) : ''}`,
` ${chalk.dim('•')} ${sc.displayName} ${chalk.cyan(`(${sc.id})`)}${sc.description ? chalk.dim(` — ${sc.description}`) : ''}`,
);
}
console.log();
Expand Down
60 changes: 60 additions & 0 deletions tests/unit/cli/redteam-categories-renderer.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import { afterEach, describe, expect, it } from 'vitest';

let output: string[];
const originalLog = console.log;

describe('renderCategories', () => {
afterEach(() => {
output = [];
console.log = originalLog;
});

it('prints empty state when no categories', async () => {
output = [];
console.log = (...args: unknown[]) => output.push(args.join(' '));
const { renderCategories } = await import('../../../src/cli/renderer/redteam.js');
renderCategories([]);
expect(output.join('\n')).toContain('No categories found.');
});

it('prints parent and sub-category IDs inline with display names', async () => {
output = [];
console.log = (...args: unknown[]) => output.push(args.join(' '));
const { renderCategories } = await import('../../../src/cli/renderer/redteam.js');
renderCategories([
{
id: 'SECURITY',
displayName: 'Security',
description: 'Select categories for adversarial testing of security vulnerabilities',
subCategories: [
{ id: 'JAILBREAK', displayName: 'Jailbreak', description: 'Jailbreak attempts' },
{
id: 'PROMPT_INJECTION',
displayName: 'Prompt Injection',
description: 'Direct prompt injection attacks',
},
],
},
]);
const text = output.join('\n');
expect(text).toContain('Security (SECURITY)');
expect(text).toContain('Jailbreak (JAILBREAK)');
expect(text).toContain('Prompt Injection (PROMPT_INJECTION)');
});

it('renders categories without descriptions', async () => {
output = [];
console.log = (...args: unknown[]) => output.push(args.join(' '));
const { renderCategories } = await import('../../../src/cli/renderer/redteam.js');
renderCategories([
{
id: 'SAFETY',
displayName: 'Safety',
subCategories: [{ id: 'TOXICITY', displayName: 'Toxicity' }],
},
]);
const text = output.join('\n');
expect(text).toContain('Safety (SAFETY)');
expect(text).toContain('Toxicity (TOXICITY)');
});
});
Loading