This repository contains a self-trained deepfake video detection system built using a Vision Transformer (ViT-B/14 with DINOv2 backbone).
The system analyzes videos temporally, identifies manipulated segments, and outputs timestamp-localized deepfake regions, with an emphasis on localization quality over raw classification accuracy.
- โ Self-trained deepfake model (not a prebuilt classifier)
- ๐ฏ Vision Transformer (ViT-B/14, DINOv2)
- โฑ๏ธ Timestamp localization of manipulated segments
- ๐ Median smoothing + temporal segment merging
- ๐๏ธ Video-level and segment-level confidence scores
- ๐ฅ๏ธ Interactive Streamlit web interface
- โก Efficient inference via 2 FPS frame sampling
| Component | Description |
|---|---|
| Backbone | vit_base_patch14_dinov2 |
| Framework | PyTorch + TIMM |
| Input Resolution | 518 ร 518 |
| Output | Binary classification (Real / Fake) |
| Weights | Self-trained (df_detector_mvp.pth) |
Each frame produces a single logit, converted to a probability using a sigmoid function.