-
Notifications
You must be signed in to change notification settings - Fork 5
Home
The original C architecture is the result of the research I've conducted during my MSc. Starting from EDSR, I miniaturised and thoroughly simplified the model to maximise quality within a very constrained performance budget. The subsequent R architecture was born from various experiments trying to scale ArtCNN up while still attempting to balance quality and performance.
My goal is to keep ArtCNN as vanilla as possible, so don't expect bleeding-edge ideas to be adopted until they've stood the test of time.
The L1 loss (mean absolute error) is still the standard loss used to train state-of-the-art distortion-based SISR models. Researchers have attempted to design more sophisticated losses to better match human perception of quality, but ultimately the best results are often obtained by combining the L1 loss with something else, just to help it get out of local minima. This is very well detailed in Loss Functions for Image Restoration with Neural Networks.
The depth to space operation is generally better than using transposed convolutions and it's the standard on SISR models in general. ArtCNN is also designed to have all of its convolution layers operating with LR feature maps, only upsampling them as the very last step. This is mostly done for speed, but it also provides great memory footprint benefits.
I'm not particularly against attention mechanisms, it's just that they add a small throughput penalty for even smaller quality gains at the sizes ArtCNN is offered.
CNNs have a stronger inductive bias to help solve image processing tasks, this means you don't need as much data or as big of a model to get good results.
Papers like A ConvNet for the 2020s and ConvNets Match Vision Transformers at Scale have also showed that CNNs are still competitive with transformers even at the scales in which they were designed to excel.