Skip to content

docs: add DLP detection test corpus#71

Merged
cdot65 merged 1 commit into
mainfrom
cdot65/dlp-detection-corpus
May 21, 2026
Merged

docs: add DLP detection test corpus#71
cdot65 merged 1 commit into
mainfrom
cdot65/dlp-detection-corpus

Conversation

@cdot65

@cdot65 cdot65 commented May 21, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds docs/dlp-detection/ — a corpus for testing how well a content scanner (Prisma AIRS) detects sensitive data hidden inside files.
  • Covers multiple modalities (PDF, JPEG, PNG, DOCX, ZIP) and hiding techniques: invisible PDF text layer (render mode 3), image metadata (EXIF/XMP/IPTC/PNG text chunks), JPEG container padding (COM segment + post-EOI bytes), rendered pixels (OCR), and LSB steganography.
  • Each carrier embeds the same synthetic markers for apples-to-apples comparison; includes base64 encodings (inline-JSON API representation), generator/verify scripts, and a per-file catalog.
  • Wired into the mkdocs nav under DLP Detection Testing.

Findings so far

  • PDF invisible-text-layer: detected.
  • JPEG metadata / container / OCR pixels: not detected.
  • PNG LSB steg: flagged "toxic content" — likely steg-presence detection rather than payload reading; controls (dlp_ctrl_clean.png, dlp_ctrl_stego_benign.png) included to confirm.

Safety

  • All values are synthetic / reserved-for-testing (reserved SSN, Visa test PAN, AWS documented example key, IANA example.com, 555-01xx phones). No real PII.
  • One synthetic Stripe-pattern token tripped GitHub push protection and was allowlisted as test data.

Test plan

  • mkdocs build is clean (verified locally — no warnings/broken links for the new section)
  • Nav entry renders and sample/encoded download links resolve
  • Continue submitting untested vectors (IPTC, PNG text chunks, DOCX, ZIP, controls) to the scanner and update the results matrix

Multi-modality synthetic scanner test files (PDF/JPEG/PNG/DOCX/ZIP) hiding
sensitive data via invisible text layer, metadata (EXIF/XMP/IPTC/PNG chunks),
container padding, rendered pixels, and LSB steg; plus base64 encodings,
generator/verify scripts, and a per-file catalog. Wired into mkdocs nav.
All data synthetic / reserved-for-testing.
@cdot65 cdot65 merged commit a1f4ec6 into main May 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant