You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The paper mentions that Backbone uses the same ViT-base as UniHCP, but when looking at the code, generate_random_masks is used in vit to shuffle. I would like to ask what the purpose of this is and if there are any more literature on this approach.
The text was updated successfully, but these errors were encountered:
During the project, we attempted to unify both supervised training and mask image modeling. However, we found that introducing the strategy of mask image modeling would decrease the performance, so we just removed this strategy. You can find that all the tokens are appended with positional embedding, then shuffled before the encoder, and will be reshuffled after the encoder. This is almost the same as not using shuffling because the positional information is restored in the positional embedding instead of their relative index in the self-attn module.
The paper mentions that Backbone uses the same ViT-base as UniHCP, but when looking at the code, generate_random_masks is used in vit to shuffle. I would like to ask what the purpose of this is and if there are any more literature on this approach.
The text was updated successfully, but these errors were encountered: