STF-Depth stands for Semantic and Temporal Fusion Depth Estimation. It is a pipeline designed to improve the inaccuracies of single-image depth estimation.
This project leverages Temporal Fusion from the video domain and Semantic Fusion via segmentation to enhance inter-frame consistency and generate more realistic depth information.
- Multi-Model Pipeline: Utilizes state-of-the-art deep learning models to generate depth and segmentation maps for each video frame.
- Depth Estimation: MiDaS (DPT-Large)
- Semantic Segmentation: DeepLabV3
- Panoptic Segmentation: OneFormer
- Automated Processing: Automatically handles the entire process from frame extraction to model inference and result saving for specified input video folders.
- Result Caching: Caches intermediate results (
.pkl) for processed videos, enabling faster re-runs for visualization or further processing by skipping the inference step. - Visualization: Saves output results from each model as image files for intuitive inspection.
- Evaluation: Includes tools for quantitative evaluation on standard datasets like NYU Depth V2 and KITTI.
All dependencies for this project are managed via a Conda virtual environment.
Create and activate a Conda environment named stfdepth.
# Create environment from the provided yaml file
conda env create -f conda.yaml
# Activate the environment
conda activate stfdepthNote: A conda.yaml file is provided in the repository.
This script processes videos or images to estimate depth maps with semantic and temporal fusion.
-
Prepare Input Data: Place your video files (
.mp4,.avi, etc.) or image folders insidedata/input/<dataset_name>/.- Default dataset name is
vp_test. So, place files indata/input/vp_test/.
- Default dataset name is
-
Run Script:
# Activate Conda environment conda activate stfdepth # Run inference python run.py
--input_dir: Directory containing input datasets (default:data/input)--output_dir: Directory to save final results (default:data/output)--working_dir: Directory for intermediate files (frames, .pkl, visualizations) (default:data/working)--datasets: List of dataset names to process (default:["vp_test"])--visualize: Flag to enable saving visualization results (default:False)
This script evaluates the depth estimation performance against Ground Truth (GT).
-
Prepare Data: Structure your data as follows:
- Input Images:
test/data/input/<dataset_name>/input/ - Ground Truth:
test/data/input/<dataset_name>/gt/
See the Evaluation Datasets section below for details on preparing NYU and KITTI datasets.
- Input Images:
-
Run Script:
python test.py --datasets nyu kitti
--input_dir: Root directory for test data (default:./test/data/input)--output_dir: Directory to save evaluation results (default:./test/data/output)--datasets: List of datasets to evaluate (default:["nyu", "kitti"])
This project uses standard datasets for quantitative evaluation. Helper scripts are provided in the test/ directory to convert raw datasets into the required format.
The NYU Depth V2 dataset is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect.
-
Download: You can download the raw dataset from the official website.
-
Preparation:
- Download the raw dataset (scene folders containing
INDEX.txtand raw images). - Use the
test/convert_nyu.pyscript to synchronize RGB and Depth frames and convert them.
# Edit 'original_dir' and 'converted_dir' in test/convert_nyu.py before running python test/convert_nyu.py- This script synchronizes frames based on timestamps, generates
.mp4videos for input, and saves stacked depth maps (.npy) for ground truth.
- Download the raw dataset (scene folders containing
The KITTI dataset is a popular benchmark for autonomous driving, including depth prediction tasks.
-
Download: Download the "depth completion" or "depth prediction" dataset from the KITTI Vision Benchmark Suite.
-
Preparation:
- Download the validation set (e.g.,
val_selection_cropped). - Use the
test/convert_kitti.pyscript to format the data.
# Edit 'original_dir' and 'converted_dir' in test/convert_kitti.py before running python test/convert_kitti.py- This script matches images with their corresponding ground truth depth maps and converts them to
.png(input) and.npy(GT) formats.
- Download the validation set (e.g.,
.
├── data
│ ├── input/ # Input data directory
│ │ └── vp_test/ # Default dataset folder
│ ├── working/ # Intermediate results (frames, .pkl, visualizations)
│ └── output/ # Final results
├── test
│ └── data/ # Data for evaluation (input images and GT)
├── run.py # Main inference script
├── test.py # Evaluation script
├── conda.yaml # Conda environment configuration
└── README.md # Project documentation



