This repo contains the code for the Multi-Resolution Fusion (MuRF) paper, which explores multi-resolution training for vision transformers across various dense prediction tasks.
pip install torch torchvision torchmetrics transformers timm matplotlib scikit-image opencv-python pandas einops albumentationsx transformsThis folder contains the semantic segmentation subexperiment for the Multi-Resolution Fusion paper
train_mrf_ade20k.py: Core MuRF training logic for ADE20K.eval_mrf_ade20k.py: Evaluation logic for ADE20K.train_voc.py: Baseline DINOv2 training loop for downstream segmentation (used for pipeline validation).train_mrf_voc.py: Core MuRF training logic.test_voc.py: Evaluation logic for the VOC splits.train_cla_voc.py: CLI and hyperparameter definitions.
ADE20K is a large-scale semantic segmentation dataset with 150 classes, encompassing a wide variety of scenes and objects.
ADE20K dataset can be downloaded from the ADE20K website or third-party repositories
The data structure should be as follows:
ADEChallengeData2016
├── images
│ ├── training
│ │ ├── ADE_train_00000001.jpg
│ │ ├── ...
│ │── validation
│ │ ├── ADE_val_00000001.jpg
│ │ ├── ...
├── annotations
│ ├── training
│ │ ├── ADE_train_00000001.png
│ │ ├── ...
│ ├── validation
│ │ ├── ADE_val_00000001.png
│ │ ├── ...
├── objectInfo150.txt
└── sceneCategories.txt
To recreate baseline results (where we only use a single scale), one can run:
python3 semantic_segmentation/train_large_ade20k.pyThe resolution can be set inside the python script train_large_ade20k.py using the variable crop_size
To recreate MuRF results, one can run:
python3 semantic_segmentation/train_mrf_ade20k.pyPascal VOC is a canonical benchmark dataset for semantic segmentation, featuring 20 foreground object classes and one background class.
Consider the mmseg data preparation markdown file for dataset preparation
The script at semantic_segmentation/dataset/generate_voc.py might be useful for downloading the VOC dataset and converting it into a format compatible with our dataloader.
To recreate baseline results (where we only use a single scale), one can run:
python -m semantic_segmentation.train_voc_cla --res 518To recreate MuRF results, one can run:
python -m semantic_segmentation.train_voc_cla --res 140,266,518This folder contains the depth estimation subexperiment for the Multi-Resolution Fusion paper. Baseline functionality and configurations reproduce DINOv2 depth estimation, anchored on the official DINOv2 ViT-B/14 NYU Linear Config.
dino_model.py: Original DINOv2 baseline implementation.mrf_model.py: Multi-Resolution Fusion (MuRF) variant.
train_mrf.py: Primary training loop for the MuRF architecture.train_cla.py: Command-line argument parsing and hyperparameter definitions.train.py: Baseline training loop (DINOv2 replication).val.py: Validation step execution and metric tracking.test.py: Final evaluation on designated test splits.
- Losses: Custom objective functions for depth regression.
- Dataset: Dataloaders and preprocessing pipelines for NYU Depth V2 and SUN RGB-D.
- Callbacks: Training hooks, including early stopping heuristics and qualitative depth map visualizations.
- Extra (
ops.py): Assorted tensor operations, including input unnormalization and ground truth depth map aesthetic formatting for NYU.
NYU Depth V2 is a widely used dataset for depth estimation, consisting of indoor scenes captured with RGB-D sensors. Following DINOv2's evaluation setup, we train on the NYU Depth V2 training set and evaluate on the NYU Depth V2 test set.
Download the dataset from the Google Drive link. The link can be found at DinoV3's README. Unzip the downloaded file and ensure the data structure is as follows:
nyu
├── basement_0001a
├── basement_0001b
...
├── nyu_test.txt
└── nyu_train.txt
Please refer to the "SUN RGB-D" section below for training and evaluation instructions, as the both the results for NYU Depth V2 and SUN RGB-D will be reported there.
Following DinoV2's evaluation setup, we evaluate the model trained on NYU Depth V2 directly on the SUN RGB-D test set without any fine-tuning.
The result will be printed after you run the training script mentioned in the NYU section.
Download the dataset from the SUN RGB-D website. The dataset includes RGB images, depth maps, and corresponding annotations. After downloading, ensure the data structure is as follows:
SUNRGBD
├── kv1
├── kv2
├── realsense
└── xtion
Download the dataset split file from https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/splits/SUNRGBD_val_splits.txt , and put it into depth_estimation/SUNRGBD_val_splits.txt
To recreate baseline results (where we only use a single scale), one can run:
python -m depth_estimation.train_cla --scales 1.0 --model-size base --lin lin1 --max-iters 38400To recreate MuRF results, one can run:
python -m depth_estimation.train_cla --scales 0.5,1.0,1.5 --model-size base --lin lin1 --max-iters 38400