Some questions about 257 x 1024 CLIP L

Hi, @PaulScotti , thanks for your great work! I'd like to know whether the model used for the CLIP L is the pre-trained model of ViT/L-14, or the pre-trained encoder of the GIT model. 

As far as I know, although the output shape of both is 257 x 1024, the GIT model is fine-tuned for image caption tasks and has better results. On the contrary, the image features obtained by the ViT/L-14 image encoder are difficult to generate image descriptions directly through the GIT model. 

Looking forward to your reply, thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about 257 x 1024 CLIP L #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some questions about 257 x 1024 CLIP L #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions