Skip to content

Latest commit

 

History

History
143 lines (114 loc) · 4.64 KB

File metadata and controls

143 lines (114 loc) · 4.64 KB

Dataset Explorer

Flask + DuckDB app for browsing and searching across training datasets (~91M rows). Full-text search, faceted filtering by dataset/duration, inline video playback for local datasets, and YouTube thumbnail extraction.

Quick Start

conda activate sam
cd data/dataset_explorer
pip install -r requirements.txt
python app.py
# Open http://localhost:5555

First startup builds a DuckDB cache + FTS index (~2 min for 91M rows). Subsequent startups reuse the cache (~5s).

Directory Layout

data/dataset_explorer/
├── app.py                  # Flask web server
├── prepare_from_export.py  # (coming soon)
├── requirements.txt
├── static/
│   └── index.html          # Frontend (single-page)
├── scripts/                # One-time setup scripts
│   ├── setup_data.py       # Source CSVs/JSONs -> normalized parquets
│   └── reconstruct_epic_kitchens.py  # Frame tar archives -> MP4 videos
└── data/                   # All data (gitignored)
    ├── source/             # Raw source files you download
    ├── *.parquet           # Generated by scripts/setup_data.py
    ├── datasets.duckdb     # DuckDB cache (auto-generated on first run)
    ├── videos/             # Local video files
    │   ├── ssv2/           # ~220k webm files (~19 GB)
    │   └── epic_kitchens/  # ~631 mp4 files
    └── thumbnails/         # On-demand video thumbnails

Setting Up Data from Scratch

Everything below runs from the data/dataset_explorer/ directory with conda activate sam.

Step 1: Download Source Data

Place each dataset's source files under data/source/:

Action100M

data/source/action100m_actions.csv

Panda-70M (HuggingFace, ~17 parquet shards)

huggingface-cli download SUSTech/panda-70m --repo-type dataset \
  --local-dir data/source/panda70m

LVP (Large Video Planner) metadata

data/source/lvp/pandas/cleaned_metadata_with_youtube_key.csv
data/source/lvp/something_something_v2/cleaned_metadata.csv
data/source/lvp/epic_kitchens/cleaned_metadata.csv

SSv2 labels (official Something-Something-V2 label JSONs)

data/source/ssv2_meta/train.json
data/source/ssv2_meta/validation.json

Epic-Kitchens annotations + frame tars

data/source/epic_kitchens/EPIC_100_train.csv
data/source/epic_kitchens/EPIC_100_validation.csv
data/source/epic_kitchens/EPIC_100_video_info.csv
data/source/epic_kitchens/rgb_frames/*.tar       # ~700 frame tar archives

Step 2: Download Local Videos

SSv2 videos (~19 GB, ~220k webm files)

# Download from HuggingFace
huggingface-cli download morpheushoc/something-something-v2 \
  --repo-type dataset --local-dir data/videos/ssv2_download

# Extract (multi-part tar.gz)
mkdir -p data/videos/ssv2
cd data/videos/ssv2_download/videos
cat 20bn-something-something-v2-* | tar -xz --strip-components=1 -C ../../ssv2/
cd ../../../..

Epic-Kitchens videos (reconstruct MP4s from frame tars, takes a few hours)

python scripts/reconstruct_epic_kitchens.py
# Resumable -- skips existing non-empty MP4s

Step 3: Generate Parquet Files

python scripts/setup_data.py

This reads all source data and produces 7 normalized parquet files:

File Dataset Rows Source
action100m.parquet Action100M ~20.3M YouTube clips
panda70m.parquet Panda-70M ~70.7M YouTube clips, per-segment captions
lvp_pandas.parquet LVP: Pandas-70M ~197k YouTube clips, cross-referenced with Panda-70M
ssv2.parquet SSv2 ~194k Local video (webm)
lvp_ssv2.parquet LVP: SSv2 ~93k Local video (webm)
epic_kitchens.parquet Epic-Kitchens ~75k Local video (mp4)
lvp_epic_kitchens.parquet LVP: Epic-Kitchens ~7k Local video (mp4)

All parquets share the same schema: video_id, caption, start_sec, end_sec, dataset, video_path (video_path is NULL for YouTube-only datasets, a filename for local datasets)

Step 4: Run the Server

python app.py
# Open http://localhost:5555

To force a DuckDB rebuild (e.g. after regenerating parquets):

rm -f data/datasets.duckdb data/datasets.duckdb.wal
python app.py

Environment Variables

Variable Default Description
DATASET_EXPLORER_DEBUG 0 Set to 1 for Flask debug mode with auto-reload
DATASET_EXPLORER_DB_PATH data/datasets.duckdb Override DuckDB path
DATASET_EXPLORER_YTDLP (auto-detect) Path to yt-dlp binary for YouTube thumbnails
DATASET_EXPLORER_THUMB_MAX_JOBS 2 Max concurrent ffmpeg thumbnail extractions