Mask-R-CNN for image-manipulation detection (current state)

This repository presents an experimental Mask R-CNN–based approach for detecting image manipulations, developed as part of the Kaggle competition Recod.ai / LUC – Scientific Image Forgery Detection. The project documents the stepwise fine-tuning, parameter exploration and the systematic debugging process of a complex instance segmentation pipeline.

⚙️ Motivation

Detecting manipulated regions in scientific images requires precise localization based on subtle image artifacts such as discontinuities in lightning and noise. Instance segmentation models such as Mask R-CNN focus only on the relevant regions of interest, but require a backbone/feature extractor that is specialized on such artifacts. We start from a pre-trained model with a generic ResNet backbone which we fine-tune in different phases, yielding an acceptable result. Run:

python3 scoring.py

to obtain the main result of our final model pretrained_final.pth:

 MEAN oF1s (the main metric):
0.414 +- 0.428
 RECALL:
0.842 +- 0.337
 PRECISION:
0.859 +- 0.0

The main conclusion states that the bottleneck lies in the classifier head, which is not able to distinguish forged regions from authentic regions. This, in turn, is caused by a generic feature extractor that was never trained to detect forgery features.

🟢 Description of Mask R-CNN

Mask-R-CNN is a region based algorithm for detection, classification and segmentation of images.

Mask R-CNN is a multi-stage convolutional neural network that performs:

Region proposal (RPN)
Bounding box regression
Object classification
Instance mask segmentation Therefore it is also hard to debug which parts are failing, although we managed to narrow the problem down to the classifier head.

Approach

Inspired by the Kaggle notebook
https://www.kaggle.com/code/antonoof/eda-r-cnn-model
which implements a Mask R-CNN–based pipeline but reports no quantitative results.
The notebook provides a solid starting point in terms of network architecture, data loading, and training loops. However, it lacks systematic numerical debugging and step-wise validation of individual model components (e.g. RPN, box regression, mask head).
This project aims to fill that gap by introducing structured debugging steps and targeted experiments to isolate failure modes.

📈 Stepwise fine-tuning

We start from a COCO-pretrained Mask R-CNN but replace the original classifier (80 classes/objects) by a binary classifier (forged / authentic). Then we follow:

Overfit a single image
- Image: 10017.png (shown below)
- Goal: achieve broad direction for the classifier + mask weights
- Strategy: freeze parts of the network (e.g. backbone)
Overfit a small subset (5 images)
- Goal: Show classifier + mask different examples and shapes
- Strategy: Alternate between freezing backbone and heads
Train on the full dataset
- Goal: Multi-stage generalization
- Unfreeze last backbone layer and the classifier

Weights are reused between steps to enable incremental fine-tuning.

💡 Quick EDA:

There are 5K images to train and 50 images for testing
The problem has a pixel-imbalance, around 5% of the pixels are forged and are therefore 1s in their corresponding masks.
The signal is very weak, the algorithm has to learn to detect discontinuities in noise along copy-pasted edges, contrasts in brightness etc.

Code:

Run

pip install -r requirements.txt
python3 edarnn.py

for training and

python3 encode_submission.py

for evaluation (DICE) over a test_dataset and visualization.

The dataset is composed of both authentic and forged/manipulated images, which are accompanied by a mask. The overfit image (10017.png) used throughout this report contains two forgery regions.

🔑 Training

First we thought that the bounding box regression was failing - we tried to combine two strategies, giving four models:

Freezing vs not freezing the mask head (responsible for segmenting the image)
Painting vs not painting bounding boxes around the forged regions to make sure this error is not used.

We will see that one of the four models outperforms the others:

So we train it for 600 epochs:

then run:

python3 encode_submission.py

and obtain:

Model weights: 
<All keys matched successfully>
../recodai-luc-scientific-image-forgery-detection/train_images/forged/47.png
 Combining 2 masks and resizing to original
 Combining 100 masks and resizing to original
Box 0: score = 0.0985
Box 1: score = 0.0740
Box 2: score = 0.0720
Box 3: score = 0.0677
Box 4: score = 0.0668
Box 5: score = 0.0593
Box 6: score = 0.0505
Box 7: score = 0.0500
Box 8: score = 0.0459
Box 9: score = 0.0450
Target masks shape: torch.Size([1, 256, 320]), sum per mask: 1177.0
Pred mask stats -> sum: 23129.7480
Full true mask stats -> sum: 1177.0000
Intersection: 107.3293, Denominator: 24306.7480, DICE: 0.008831

Idx: 0 DICE: 0.0088

🧠 Results

The image in the left has the manipulated region masked in yellow, as well as the GT/pred boxes marking the regions of interest. We can see that the network has learnt to find the correct box size and regress it towards potentially interesting regions.

The mask and box regression look good ✅

But the log above shows that the classifier struggles to distinguish between authentic and forged regions. ❌ It is a fully connected layer fed by the feature vector of that region, which in turn comes from the generic feature extractor: ResNet. Therefore we might ask if the extracted features actually contain any cues about the forgery-ness of that region, or if we should have fine-tuned the whole feature extractor on the whole dataset. We propose this to be the next step in continuing this work.

In any case, our model is over-segmenting the image - a problem that we cannot solve via thresholding. As a proof, we enforced the original GT_scores and observed a significant increase in the final oF1 score.

🚀 Conclusion / further work

As discussed, we propose that the feature extractor/backbone has to be trained on the whole dataset. Its impossible for the classifier to distinguish manipulated regions if the features do not quantify the correct properties (discontinuities in noise, lighting, image and compression artifacts, rough edges etc.).

We can actually check if the feature vector contains forgery cues: Collect N feature vectors of forged regions forged_features.shape is [N, 256] Collect N feature vectors of authentic regions authentic_features.shape is [N, 256]

Average both vectors over the samples and subtract them: forgery_quantification = f_mean - a_mean = tensor([-4.0986e-02, 4.9509e-01, -4.8352e-01, 4.5262e-01, -8.6340e-02 etc.]

There, any large element (>0.9) marks a quantity that consistently differs between all forged and authentic regions. We actually found three elements tensor([0.9468, 1.0817, 1.0582], device='cuda:0'). We could also just sum over the elements of forgery_quantification into a metric that quantifies how well the encoder is concentrating into our actual task: the detection of image manipulation artifacts. This could also be tracked and optimized during fine-tuning of the encoder.

** Plotting the bounding box regression for different epochs:**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mask-R-CNN for image-manipulation detection (current state)

⚙️ Motivation

🟢 Description of Mask R-CNN

Approach

📈 Stepwise fine-tuning

💡 Quick EDA:

Code:

🔑 Training

🧠 Results

🚀 Conclusion / further work

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Mask-R-CNN for image-manipulation detection (current state)

⚙️ Motivation

🟢 Description of Mask R-CNN

Approach

📈 Stepwise fine-tuning

💡 Quick EDA:

Code:

🔑 Training

🧠 Results

🚀 Conclusion / further work