Flask + DuckDB app for browsing and searching across training datasets (~91M rows). Full-text search, faceted filtering by dataset/duration, inline video playback for local datasets, and YouTube thumbnail extraction.
conda activate sam
cd data/dataset_explorer
pip install -r requirements.txt
python app.py
# Open http://localhost:5555First startup builds a DuckDB cache + FTS index (~2 min for 91M rows). Subsequent startups reuse the cache (~5s).
data/dataset_explorer/
├── app.py # Flask web server
├── prepare_from_export.py # (coming soon)
├── requirements.txt
├── static/
│ └── index.html # Frontend (single-page)
├── scripts/ # One-time setup scripts
│ ├── setup_data.py # Source CSVs/JSONs -> normalized parquets
│ └── reconstruct_epic_kitchens.py # Frame tar archives -> MP4 videos
└── data/ # All data (gitignored)
├── source/ # Raw source files you download
├── *.parquet # Generated by scripts/setup_data.py
├── datasets.duckdb # DuckDB cache (auto-generated on first run)
├── videos/ # Local video files
│ ├── ssv2/ # ~220k webm files (~19 GB)
│ └── epic_kitchens/ # ~631 mp4 files
└── thumbnails/ # On-demand video thumbnails
Everything below runs from the data/dataset_explorer/ directory with conda activate sam.
Place each dataset's source files under data/source/:
Action100M
data/source/action100m_actions.csv
Panda-70M (HuggingFace, ~17 parquet shards)
huggingface-cli download SUSTech/panda-70m --repo-type dataset \
--local-dir data/source/panda70mLVP (Large Video Planner) metadata
data/source/lvp/pandas/cleaned_metadata_with_youtube_key.csv
data/source/lvp/something_something_v2/cleaned_metadata.csv
data/source/lvp/epic_kitchens/cleaned_metadata.csv
SSv2 labels (official Something-Something-V2 label JSONs)
data/source/ssv2_meta/train.json
data/source/ssv2_meta/validation.json
Epic-Kitchens annotations + frame tars
data/source/epic_kitchens/EPIC_100_train.csv
data/source/epic_kitchens/EPIC_100_validation.csv
data/source/epic_kitchens/EPIC_100_video_info.csv
data/source/epic_kitchens/rgb_frames/*.tar # ~700 frame tar archives
SSv2 videos (~19 GB, ~220k webm files)
# Download from HuggingFace
huggingface-cli download morpheushoc/something-something-v2 \
--repo-type dataset --local-dir data/videos/ssv2_download
# Extract (multi-part tar.gz)
mkdir -p data/videos/ssv2
cd data/videos/ssv2_download/videos
cat 20bn-something-something-v2-* | tar -xz --strip-components=1 -C ../../ssv2/
cd ../../../..Epic-Kitchens videos (reconstruct MP4s from frame tars, takes a few hours)
python scripts/reconstruct_epic_kitchens.py
# Resumable -- skips existing non-empty MP4spython scripts/setup_data.pyThis reads all source data and produces 7 normalized parquet files:
| File | Dataset | Rows | Source |
|---|---|---|---|
action100m.parquet |
Action100M | ~20.3M | YouTube clips |
panda70m.parquet |
Panda-70M | ~70.7M | YouTube clips, per-segment captions |
lvp_pandas.parquet |
LVP: Pandas-70M | ~197k | YouTube clips, cross-referenced with Panda-70M |
ssv2.parquet |
SSv2 | ~194k | Local video (webm) |
lvp_ssv2.parquet |
LVP: SSv2 | ~93k | Local video (webm) |
epic_kitchens.parquet |
Epic-Kitchens | ~75k | Local video (mp4) |
lvp_epic_kitchens.parquet |
LVP: Epic-Kitchens | ~7k | Local video (mp4) |
All parquets share the same schema: video_id, caption, start_sec, end_sec, dataset, video_path
(video_path is NULL for YouTube-only datasets, a filename for local datasets)
python app.py
# Open http://localhost:5555To force a DuckDB rebuild (e.g. after regenerating parquets):
rm -f data/datasets.duckdb data/datasets.duckdb.wal
python app.py| Variable | Default | Description |
|---|---|---|
DATASET_EXPLORER_DEBUG |
0 |
Set to 1 for Flask debug mode with auto-reload |
DATASET_EXPLORER_DB_PATH |
data/datasets.duckdb |
Override DuckDB path |
DATASET_EXPLORER_YTDLP |
(auto-detect) | Path to yt-dlp binary for YouTube thumbnails |
DATASET_EXPLORER_THUMB_MAX_JOBS |
2 |
Max concurrent ffmpeg thumbnail extractions |