-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
Hi, @PaulScotti , thanks for your great work! I'd like to know whether the model used for the CLIP L is the pre-trained model of ViT/L-14, or the pre-trained encoder of the GIT model.
As far as I know, although the output shape of both is 257 x 1024, the GIT model is fine-tuned for image caption tasks and has better results. On the contrary, the image features obtained by the ViT/L-14 image encoder are difficult to generate image descriptions directly through the GIT model.
Looking forward to your reply, thank you.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels