
Dear Authors,
Thank you for your impressive work!
I have a question regarding the implementation details of the RAE mentioned in the "Practical Notes" section. You mentioned that:
The input images are interpolated to 224 × 224 with an encoder patch size pe =14.
The decoder uses a patch size pd = 16 to reconstruct 256 × 256 images.
Both configurations result in 256 tokens (16×16 grid).
My questions are:
-
Motivation for Asymmetry: What was the primary motivation for keeping the token count consistent (256) while allowing the spatial resolution to change between input (224) and output (256)? Is this primarily to accommodate the fixed pe =14 constraint of the pre-trained DINOv2 weights?
-
Did you experiment with a symmetric setup (e.g., both input and output at 224 × 224 or 256 × 256)? Given that newer variants (like DINOv3) or different backbones support a patch size of 16, which would allow for a symmetric 256 × 256 pipeline, did you observe any significant impact on the quality of learned representations or reconstruction fidelity when using the asymmetrical 224-to-256 approach?
Thank you for your time and for sharing your insights!
Dear Authors,
Thank you for your impressive work!
I have a question regarding the implementation details of the RAE mentioned in the "Practical Notes" section. You mentioned that:
The input images are interpolated to 224 × 224 with an encoder patch size pe =14.
The decoder uses a patch size pd = 16 to reconstruct 256 × 256 images.
Both configurations result in 256 tokens (16×16 grid).
My questions are:
Motivation for Asymmetry: What was the primary motivation for keeping the token count consistent (256) while allowing the spatial resolution to change between input (224) and output (256)? Is this primarily to accommodate the fixed pe =14 constraint of the pre-trained DINOv2 weights?
Did you experiment with a symmetric setup (e.g., both input and output at 224 × 224 or 256 × 256)? Given that newer variants (like DINOv3) or different backbones support a patch size of 16, which would allow for a symmetric 256 × 256 pipeline, did you observe any significant impact on the quality of learned representations or reconstruction fidelity when using the asymmetrical 224-to-256 approach?
Thank you for your time and for sharing your insights!