Here, we provide the training process of our EmbodiedSplat on ScanNet dataset. Training on ScanNet++ can be done in similar way.
Download the training split of preprocessed ScanNet data from here and locate it under dataset/scannet folder. Folder structure would look like:
scannet
├── test
│ └── ...
├── train
│ ├── scene0001_01
│ │ ├── color
│ │ │ ├── 0.jpg
│ │ │ ├── 1.jpg
│ │ │ └── ...
│ │ ├── depth
│ │ │ ├── 0.png
│ │ │ ├── 1.png
│ │ │ └── ...
│ │ ├── intrinsic
│ │ │ ├── extrinsic_color.txt
│ │ │ ├── extrinsic_depth.txt
│ │ │ ├── intrinsic_color.txt
│ │ │ └── intrinsic_depth.txt
│ │ ├── pose
│ │ │ ├── 0.txt
│ │ │ ├── 1.txt
│ │ │ └── ...
│ │ └── extrinsics.npy
│ └── ...
├── train_idx.txt
└── test_idx.txt
For training, we cache the instance masks generated by FastSAM and the corresponding instance-level features extracted by OpenSeg for every multi-view images. To generate these cached files, run the following command:
python process_scannet.py
You should now see a cache folder under each scene in the training split of ScanNet.
As mentioned in the Implementation Details section of the main paper, we adopt a two-stage training pipeline following EmbodiedSAM. Stage 1 conducts single-view training to warm-up the model. In this stage, memory adapter is excluded from the training.
bash scripts/scannet/train_embodiedsplat_s1.sh --gpu 0
Stage 2 further finetunes the model with additional memory adapter in multi-view setting.
bash scripts/scannet/train_embodiedsplat_s2.sh --gpu 0