Skip to content

Multi-modal watermark dataset contribution: 1,113 files across image/audio/video (SynthID + C2PA) #19

@biangacila

Description

@biangacila

Multi-Modal Watermark Dataset: 1,633 Files

Download: dataset.tar.gz (1.3 GB)


Summary

Modality Model Watermark Reference Diverse Total
Image gemini/nano-banana-pro-preview SynthID 390 100 490
Image gemini/gemini-3.1-flash-image-preview SynthID 489 100 589
Image openai/dall-e-3 C2PA metadata 180 100 280
Audio gemini/gemini-2.5-flash-preview-tts SynthID 187 30 217
Video gemini/veo-3.1-generate-preview SynthID 18 10 28
Video gemini/veo-3.0-fast-generate-001 SynthID 6 6
Video openai/sora-2 C2PA metadata 13 10 23
1,283 350 1,633

Only models with confirmed watermarking were used. Models without confirmed watermarks (OpenAI TTS, gpt-image-1/1.5, FLUX, DashScope) were intentionally excluded.


Image — 1,359 files

Nano Banana Pro SynthID (490 files)

The model the project specifically requests for contributions. Generated via image-to-image (send pure color PNG → NBP recreates with SynthID).

Reference (390): 6 colors × 7 resolutions × ~10 each.

Color RGB Purpose
Black #000000 Watermark ≈ entire pixel content
White #FFFFFF Cross-validation via phase inversion
Red #FF0000 R-channel carrier isolation
Green #00FF00 G-channel carrier isolation (strongest per project findings)
Blue #0000FF B-channel carrier isolation
Gray #808080 Mid-tone carrier behavior

Diverse (100): Text-to-image with varied prompts.

Gemini 3.1 Flash SynthID (589 files)

Same approach as NBP but from a different Gemini model — allows comparing SynthID patterns across model variants.

Reference (489): 6 colors × 7 resolutions × ~10 each.

7 distinct output resolutions (each has different SynthID carrier frequency positions):

Output Resolution Input Size Count (NBP) Count (Flash)
1024×1024 1024×1024 53 67
1195×896 640×480 49 71
1264×843 768×512 56 72
1365×768 1920×1080 53 69
1440×720 2048×1024 56 63
720×1440 1024×2048 54 72
768×1365 1080×1920 53 71

Diverse (100): Text-to-image with varied prompts. Primarily 1408×768.

DALL-E 3 C2PA Baseline (280 files)

Control group — C2PA metadata watermark (trivially removable via exiftool -all= image.png).

Reference (180): 6 colors × 3 sizes (1024×1024, 1024×1792, 1792×1024) × 10 each.
Diverse (100): Mixed sizes.


Audio — 217 files

Gemini TTS SynthID

The only confirmed audio watermarker. Same principle as black images: minimal utterances maximize watermark-to-content ratio.

Reference (187): 8 voices × 10 texts × 3 repeated generations (for differential extraction).

  • Voices: Kore, Puck, Charon, Fenrir, Aoede, Leda, Orus, Zephyr
  • Texts: "A.", "One.", "Ah.", "Hmm.", "Okay.", "Yes.", "No.", "Hello.", "The.", "Aaaaaaaah."
  • Format: WAV, 24 kHz, 16-bit mono, 0.5–2.9s

Diverse (30): Natural sentences, 4.5–7.6s.


Video — 57 files

Veo 3.1 SynthID (28 files)

Reference (18): Solid color screens × 3 each, 4s, 1280×720.
Diverse (10): Natural scenes, 8s.

Veo 3.0 Fast SynthID (6 files)

Reference (6): Black/white, 4s.

Sora 2 C2PA Baseline (23 files)

Reference (13): Solid colors (some blocked by moderation). Diverse (10): 8s.


Directory Structure

dataset/
├── image/
│   ├── nano-banana-pro/
│   │   ├── reference/    # 390 PNGs — {color}_{WxH}_{idx}.png
│   │   └── diverse/      # 100 PNGs — diverse_{idx}.png
│   ├── gemini/
│   │   ├── reference/    # 489 PNGs — {color}_{WxH}_{idx}.png
│   │   └── diverse/      # 100 PNGs — diverse_{idx}.png
│   └── dall-e-3/
│       ├── reference/    # 180 PNGs — {color}_{size}_{idx}.png
│       └── diverse/      # 100 PNGs — diverse_{size}_{idx}.png
├── audio/
│   └── gemini-tts/
│       ├── reference/    # 187 WAVs — ref{NN}_{voice}_{idx}.wav
│       └── diverse/      #  30 WAVs — diverse_{voice}_{idx}.wav
└── video/
    ├── veo-3.1/
    │   ├── reference/    #  18 MP4s — {color}_{idx}.mp4
    │   └── diverse/      #  10 MP4s — diverse_{idx}.mp4
    ├── veo-3.0-fast/
    │   └── reference/    #   6 MP4s
    └── sora-2/
        ├── reference/    #  13 MP4s
        └── diverse/      #  10 MP4s

Watermark Evidence

Model System Evidence
Nano Banana Pro / Gemini Flash / Veo / Gemini TTS SynthID (invisible, frequency-domain) deepmind.google/technologies/synthid
DALL-E 3 / Sora 2 C2PA metadata (XMP, easily stripped) c2pa.org

Value

  1. Nano Banana Pro data — the specific model the project requests for contributions
  2. Two Gemini model variants (NBP + Flash) — compare SynthID patterns across model versions
  3. 7 new resolution profiles — existing dataset only has 1024×1024 and 2816×1536
  4. First audio watermark dataset — enables SynthID audio detection/removal research
  5. First video watermark dataset — enables SynthID multi-frame spectral analysis
  6. Per-channel R/G/B reference — direct per-channel carrier isolation
  7. C2PA baseline — control group for invisible vs. metadata watermarking
  8. Differential audio pairs — same text × voice × 3 gens for cross-correlation extraction

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions