Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions docs/dlp-detection/catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Test File Catalog

Per-file detail: what each carrier is, what is embedded, where/how it is hidden, and how a
scanner would have to detect it. All payloads are the synthetic markers documented on the
[overview](index.md).

Each entry links to the raw carrier (`samples/`) and its base64 encoding (`encoded/`).

---

## PDF

### `Keychron_Q6_HE_User_Manual_DLP.pdf`

- **Source:** [samples/Keychron_Q6_HE_User_Manual_DLP.pdf](samples/Keychron_Q6_HE_User_Manual_DLP.pdf) · [base64](encoded/Keychron_Q6_HE_User_Manual_DLP.pdf.b64)
- **Technique:** invisible text layer using PDF **text render mode 3** (the same mechanism OCR
layers use). 31 synthetic lines placed in empty vertical gaps across 18 of 22 pages.
- **Within it:** SSNs, credit-card PANs, AWS keys/tokens, a password, a DB connection string,
synthetic identities (names/addresses/DOB), emails, phones, passport/DL/IBAN/routing.
- **Visible?** No — pages render pixel-identical to the original.
- **Detect by:** PDF text extraction (`pdftotext`); each value extracts contiguously.
- **Result:** :material-check: **detected** by the scanner.
- **Generator:** `scripts/embed_dlp.py`

---

## Image ladder

A single base image carried into four layers, each probing a different scanner capability.

### `dlp_img_base.jpg`

- **Source:** [samples/dlp_img_base.jpg](samples/dlp_img_base.jpg)
- The clean synthetic base image (gradient + colored blocks). **No payload.** All other image
files derive from this.

### `dlp_img_1_metadata.jpg` — EXIF + XMP

- **Source:** [samples/dlp_img_1_metadata.jpg](samples/dlp_img_1_metadata.jpg) · [base64](encoded/dlp_img_1_metadata.jpg.b64)
- **Within it:** markers in EXIF `ImageDescription`, `Artist`, `Copyright`, `XPComment`,
`XPKeywords`, `UserComment`, plus an XMP packet (`dc:description`, `dc:subject`).
- **Visible?** No (metadata, not rendered).
- **Detect by:** parsing image metadata.
- **Result:** :material-close: not detected.

### `dlp_img_2_container.jpg` — container plaintext

- **Source:** [samples/dlp_img_2_container.jpg](samples/dlp_img_2_container.jpg) · [base64](encoded/dlp_img_2_container.jpg.b64)
- **Within it:** a JPEG `COM` comment segment (after `SOI`) **and** plaintext bytes appended
after the `FFD9` end-of-image marker (ignored by viewers).
- **Visible?** No.
- **Detect by:** raw whole-file/byte scanning, not just recognized fields.
- **Result:** :material-close: not detected.

### `dlp_img_3_ocr.jpg` — rendered pixels (OCR)

- **Source:** [samples/dlp_img_3_ocr.jpg](samples/dlp_img_3_ocr.jpg) · [base64](encoded/dlp_img_3_ocr.jpg.b64)
- **Within it:** the markers **painted onto the image** as actual pixels (dark text on a light
band). There is **no text layer** — the data exists only as pixels.
- **Visible?** **Yes.**
- **Detect by:** running OCR on the image (verified recoverable with `tesseract`).
- **Result:** :material-close: not detected — the scanner does not OCR.

### `dlp_img_4_stego.png` — LSB steganography

- **Source:** [samples/dlp_img_4_stego.png](samples/dlp_img_4_stego.png) · [base64](encoded/dlp_img_4_stego.png.b64)
- **Within it:** markers encoded into the **least-significant bits** of pixel data, with a
4-byte length prefix. Decoded payload is 209 bytes.
- **Why PNG (not JPEG):** classic LSB steg does **not** survive JPEG's lossy DCT
quantization; a lossless format is required. True in-JPEG steg needs DCT-coefficient
embedding (steghide/F5-style).
- **Visible?** No (invisible; not plaintext anywhere in the file).
- **Detect by:** steganalysis / LSB extraction.
- **Result:** :material-alert: flagged as **"toxic content."** Because the plaintext PII in
#1–#3 was missed, this most likely indicates **steg-presence detection**, not payload
reading. See controls below.

---

## Controls (disambiguate the #4 result)

### `dlp_ctrl_clean.png`

- **Source:** [samples/dlp_ctrl_clean.png](samples/dlp_ctrl_clean.png) · [base64](encoded/dlp_ctrl_clean.png.b64)
- Same image saved as PNG with **no embedded data**. False-positive control: if this flags,
the trigger is the image itself, not hidden data.

### `dlp_ctrl_stego_benign.png`

- **Source:** [samples/dlp_ctrl_stego_benign.png](samples/dlp_ctrl_stego_benign.png) · [base64](encoded/dlp_ctrl_stego_benign.png.b64)
- LSB steg carrying **only benign lorem-ipsum** (verified to contain no markers). Isolates
*steg-presence* from *payload content*: if this still flags "toxic," the scanner is
reacting to steganography, not to the sensitive data.

---

## Other modalities

### `dlp_doc_sensitive.docx`

- **Source:** [samples/dlp_doc_sensitive.docx](samples/dlp_doc_sensitive.docx) · [base64](encoded/dlp_doc_sensitive.docx.b64)
- **Within it:** markers in the **visible body**, a **hidden run** (white, 1pt font), and the
**core document properties** (`author`, `comments`, `keywords`).
- **Detect by:** Office Open XML parsing — body text, run-level formatting, and `docProps`.

### `dlp_archive.zip`

- **Source:** [samples/dlp_archive.zip](samples/dlp_archive.zip) · [base64](encoded/dlp_archive.zip.b64)
- **Within it:** `payload.txt` (the markers as plaintext) compressed inside the archive.
- **Detect by:** archive recursion — decompress and scan contained files.

### `payload.txt`

- **Source:** [samples/payload.txt](samples/payload.txt)
- The raw synthetic payload as plaintext. Baseline: a scanner that misses this misses
everything.

---

## Metadata variants

### `dlp_img_5_pngtext.png` — PNG text chunks

- **Source:** [samples/dlp_img_5_pngtext.png](samples/dlp_img_5_pngtext.png) · [base64](encoded/dlp_img_5_pngtext.png.b64)
- **Within it:** markers in PNG `tEXt` / `zTXt` (compressed) / `iTXt` (utf-8) chunks — a
different metadata mechanism than JPEG EXIF/XMP. Tests whether a metadata blind spot
extends to PNG textual metadata.

### `dlp_img_6_iptc.jpg` — IPTC

- **Source:** [samples/dlp_img_6_iptc.jpg](samples/dlp_img_6_iptc.jpg) · [base64](encoded/dlp_img_6_iptc.jpg.b64)
- **Within it:** markers across IPTC IIM fields (`Caption-Abstract`, `Headline`, `Keywords`,
`By-line`, `SpecialInstructions`, `Credit`, `Source`). IPTC is the metadata standard most
asset-management and many DLP tools read.

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_archive.zip.b64
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
UEsDBBQAAAAIAINBtVxxbCkpswAAANIAAAALAAAAcGF5bG9hZC50eHRVjLsOgjAUQPd+xR11qPYitegkSjGIPEJ94GSqwYDhFdHo52uMi8tZTs5RKpwCExZlnCIajOyKToOJiIB/IPZeHX15ANv3bC9SbuSEoZCpHcRr+ZVKLhK5gedKl7d0e69dGXhDXwSODJfDU/xKLsXi8As+IyIrXZRwbfJ60D5OZXGeZS9dtWU2ODcVQJw3dQY9znkfGDLK0DRIrLuubW53SNEYmXwsLAAnmgNOLE6ZoGiSN1BLAQIUAxQAAAAIAINBtVxxbCkpswAAANIAAAALAAAAAAAAAAAAAACkgQAAAABwYXlsb2FkLnR4dFBLBQYAAAAAAQABADkAAADcAAAAAAA=
1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_ctrl_clean.png.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_ctrl_stego_benign.png.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_doc_sensitive.docx.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_img_1_metadata.jpg.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_img_2_container.jpg.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_img_3_ocr.jpg.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_img_4_stego.png.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_img_5_pngtext.png.b64

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/dlp-detection/encoded/dlp_img_6_iptc.jpg.b64

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions docs/dlp-detection/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# DLP Detection Testing

A corpus of crafted files used to evaluate how well a content scanner (e.g. Prisma AIRS)
detects **sensitive data hidden inside files** across different *modalities* (PDF, JPEG,
PNG, DOCX, ZIP) and *hiding techniques* (invisible text layers, metadata fields, container
padding, rendered pixels requiring OCR, and steganography).

Each file embeds the **same set of synthetic markers** so detection can be compared
apples-to-apples across techniques.

!!! danger "All data is synthetic — no real PII"
Every value in this corpus is drawn from a reserved / documented test range and refers
to no real person or account:

| Type | Value | Why it's safe |
| --- | --- | --- |
| SSN | `078-05-1120` | Historically reserved demo SSN, never issued |
| Credit card | `4111 1111 1111 1111` | Standard Visa test PAN (passes Luhn, not a real account) |
| AWS credentials | `AKIAIOSFODNN7EXAMPLE` / `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` | AWS's own documented example key/secret |
| Email | `john.public@example.com` | IANA-reserved `example.com` domain |
| Phone | `(555) 010-0142` | `555-0100`–`555-0199` fictional-use block |
| Identity | `Passport X12345678`, `DOB 1985-07-14` | Invented |

## Methodology

1. **Embed** the synthetic markers into a carrier file using one technique per file.
2. Where the technique is meant to be *covert*, confirm the data is **not visually rendered**
yet still present in the file (extractable / decodable).
3. **Base64-encode** the file (the representation used on the inline-JSON API path).
4. **Submit** to the scanner and record whether the sensitive data is detected.

See [Test File Catalog](catalog.md) for exactly what each file contains and how the data is hidden.

## Results so far

Legend: :material-check: detected · :material-close: not detected · :material-alert: anomalous · — untested

| Modality / technique | File | Visible? | Detected? | Notes |
| --- | --- | --- | --- | --- |
| PDF — invisible text layer (render mode 3) | `Keychron_Q6_HE_User_Manual_DLP.pdf` | No | :material-check: | 31 lines across 18 pages; **caught** |
| JPEG — EXIF + XMP metadata | `dlp_img_1_metadata.jpg` | No | :material-close: | metadata not parsed |
| JPEG — COM segment + bytes after EOI | `dlp_img_2_container.jpg` | No | :material-close: | raw container not scanned |
| JPEG — rendered pixels (OCR needed) | `dlp_img_3_ocr.jpg` | **Yes** | :material-close: | scanner does not OCR |
| PNG — LSB steganography | `dlp_img_4_stego.png` | No | :material-alert: | flagged **"toxic content"** — see below |
| JPEG — IPTC metadata | `dlp_img_6_iptc.jpg` | No | — | metadata variant |
| PNG — text chunks (tEXt/zTXt/iTXt) | `dlp_img_5_pngtext.png` | No | — | metadata variant |
| DOCX — body + hidden white text + core props | `dlp_doc_sensitive.docx` | Partly | — | Office modality |
| ZIP — payload.txt inside archive | `dlp_archive.zip` | No | — | archive recursion |
| Plaintext baseline | `samples/payload.txt` | Yes | — | sanity baseline |

!!! warning "Open question — the stego PNG result"
The plaintext PII in files #1–#3 was **missed**, but the *steganographic* PNG (#4) was
flagged as **"toxic content."** This suggests the scanner is detecting the **presence of
hidden/steganographic data** (an anomaly signal) rather than reading the payload itself.
Two controls are included to confirm:

- `dlp_ctrl_clean.png` — identical image, **no** embedded data (false-positive control).
- `dlp_ctrl_stego_benign.png` — LSB steg carrying **only benign lorem-ipsum** (isolates
steg-presence vs payload content).

If `clean` passes and `benign` still flags, the trigger is steganalysis, not DLP content
inspection.

## Layout

```
docs/dlp-detection/
├── index.md # this page
├── catalog.md # per-file detail: what is what, what is within what
├── samples/ # the raw carrier files
├── encoded/ # base64 encodings (inline-JSON API representation)
└── scripts/ # generators + verifier (provenance / regenerate)
```

## Regenerate

From the `scripts/` directory (requires `pypdf reportlab pillow numpy piexif python-docx`,
plus `tesseract` and `exiftool` for the OCR/IPTC steps):

```bash
python3 embed_dlp.py # the PDF invisible-text-layer set
python3 build_image_dlp.py # image ladder: metadata / container / OCR / LSB stego
python3 build_png_text.py # PNG text-chunk metadata variant
python3 build_more_dlp.py # controls + DOCX + ZIP
python3 verify_image_dlp.py # confirms each image still carries its payload
```

## Submit to a scanner

The `encoded/` files are ready for the inline-JSON path. Copy one to the clipboard:

```bash
pbcopy < encoded/dlp_img_4_stego.png.b64
```

Mind the media type per file: `application/pdf`, `image/jpeg`, `image/png`,
`application/vnd.openxmlformats-officedocument.wordprocessingml.document` (docx),
`application/zip`.
Binary file not shown.
Binary file added docs/dlp-detection/samples/dlp_archive.zip
Binary file not shown.
Binary file added docs/dlp-detection/samples/dlp_ctrl_clean.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/dlp-detection/samples/dlp_doc_sensitive.docx
Binary file not shown.
Binary file added docs/dlp-detection/samples/dlp_img_1_metadata.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/dlp-detection/samples/dlp_img_3_ocr.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/dlp-detection/samples/dlp_img_4_stego.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/dlp-detection/samples/dlp_img_5_pngtext.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/dlp-detection/samples/dlp_img_6_iptc.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/dlp-detection/samples/dlp_img_base.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions docs/dlp-detection/samples/payload.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
SSN: 078-05-1120
Visa 4111 1111 1111 1111
AWS_KEY AKIAIOSFODNN7EXAMPLE
AWS_SECRET wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Email john.public@example.com Phone (555) 010-0142
Passport X12345678 DOB 1985-07-14
Loading
Loading