Skip to content

Question about encoder choice for downstream tasks #86

@gritYCDA

Description

@gritYCDA

Question about encoder choice for downstream tasks

In the video classification evaluation code, I noticed that the target encoder (y-encoder) is being used for downstream tasks instead of the context encoder (x-encoder). This seems different from other self-supervised learning approaches:

  1. Most SSL methods like MOCO, SimCLR, and BYOL use their main/query encoder for downstream tasks rather than the momentum/target encoder.

  2. In V-JEPA, the y-encoder has stop_gradient applied during training, which intuitively suggests the x-encoder might be more suitable for downstream tasks since it learns to predict comprehensive features from partial information.

Looking at the implementation, I noticed the following code for loading checkpoints:

checkpoint = torch.load(pretrained, map_location='cpu')
try:
    pretrained_dict = checkpoint[checkpoint_key]
except Exception:
    pretrained_dict = checkpoint['encoder']

While the code primarily uses the target_encoder (through checkpoint_key), it seems there's a fallback option to use 'encoder'. This suggests that using the context encoder might still be possible, though not prioritized.

I'm curious about:

  1. Was there experimental evidence showing that the target encoder consistently performs better than the context encoder for downstream tasks?
  2. If so, was this the reason for prioritizing the target encoder in the implementation?
  3. Are there specific characteristics of V-JEPA that make the target encoder more suitable for downstream tasks, unlike other SSL approaches?

Would appreciate any insights into this design choice. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions