You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first image is your Table 1, and the second is DINOv2's Table 4.
The difference between 82.0% as reported in your paper versus 83.5% in DINOv2's is quite large. It's also apparent that DINOv2-L is a patch-size-14 model, not a 16. Are you accounting for this difference by resizing DINO's patch projector? Or just letting the different models have a different number of patches? Or are you feeding them different image sizes?
The text was updated successfully, but these errors were encountered:
Here are two primary reasons for the observed performance difference:
Difference of model configuration: We adjusted the model configuration for a fair comparison with most self-supervised learning methods by setting the patch size to 16. However, it is generally observed that a smaller patch size, such as 14, tends to deliver better performance than 16.
Difference of pre-training methods and data: The original DINOv2 paper utilized an unreleased dataset, whereas we used the standard ImageNet-1k dataset for our reproduction. Furthermore, the ViT-L model in the original DINOv2 paper was distilled from a pre-trained ViT-H model, while we trained ViT-L model from scratch.
The first image is your Table 1, and the second is DINOv2's Table 4.
The difference between 82.0% as reported in your paper versus 83.5% in DINOv2's is quite large. It's also apparent that DINOv2-L is a patch-size-14 model, not a 16. Are you accounting for this difference by resizing DINO's patch projector? Or just letting the different models have a different number of patches? Or are you feeding them different image sizes?
The text was updated successfully, but these errors were encountered: