This repository contains the code and experiments for a semantic segmentation task on the CamVid dataset. We implemented and compared several approaches, including:
- A custom UNet-inspired architecture from scratch
- A MobileNetV2-based model with transfer learning
- DeepLab using ResNet and ASPP (Atrous Spatial Pyramid Pooling)
- Total images: 701
- Split:
- Train: 401
- Validation: 150
- Test: 150
- Labels: 32 semantic classes (including a "Void" class for unlabeled pixels)
- Shuffling of images and labels
- Resizing to 224×224:
- Bicubic interpolation for images
- Nearest neighbor for labels
- Label Encoding: RGB to class index mapping + one-hot encoding
- Fully convolutional with encoding and expansive paths
- Skip connections
- Categorical cross-entropy loss
- Optimizer: Adam
- Activation: ReLU
- Performance metric: Mean IoU
Experiments:
-
Various data augmentations (rotation, cropping, Gaussian noise)
-
Hyperparameter tuning:
- Batch size (4, 8, 16)
- Regularization (L1, L2, Dropout)
- Activation functions (ReLU, Sigmoid, Tanh — trainable activations for future work)
- Optimizers (Adam, SGD, RMSprop)
- Pretrained on ImageNet
- Used as the encoder in a segmentation head
- Efficient but struggled due to domain mismatch (natural vs. street scenes)
- Based on ResNet encoder with ASPP
- Pretrained on ImageNet
- Best performance across all metrics
Model | Mean IoU |
---|---|
DeepLab | 0.5242 |
UNet | 0.4507 |
MobileNetV2 | 0.3836 |
- DeepLab performed best due to its powerful feature extractor.
- All models struggled with shadowed regions and fine-grained class distinctions.
- Trainable activation functions
- Advanced augmentation strategies (e.g., ClassMix)
- Hyperparameter optimization (grid/Bayesian search)
- Better handling of the Void class
- Exploration of additional architectures
- Cristian Longoni
- Robin Smith
- Sergio Verga