-
Notifications
You must be signed in to change notification settings - Fork 368
Description
Question about encoder choice for downstream tasks
In the video classification evaluation code, I noticed that the target encoder (y-encoder) is being used for downstream tasks instead of the context encoder (x-encoder). This seems different from other self-supervised learning approaches:
-
Most SSL methods like MOCO, SimCLR, and BYOL use their main/query encoder for downstream tasks rather than the momentum/target encoder.
-
In V-JEPA, the y-encoder has stop_gradient applied during training, which intuitively suggests the x-encoder might be more suitable for downstream tasks since it learns to predict comprehensive features from partial information.
Looking at the implementation, I noticed the following code for loading checkpoints:
checkpoint = torch.load(pretrained, map_location='cpu')
try:
pretrained_dict = checkpoint[checkpoint_key]
except Exception:
pretrained_dict = checkpoint['encoder']While the code primarily uses the target_encoder (through checkpoint_key), it seems there's a fallback option to use 'encoder'. This suggests that using the context encoder might still be possible, though not prioritized.
I'm curious about:
- Was there experimental evidence showing that the target encoder consistently performs better than the context encoder for downstream tasks?
- If so, was this the reason for prioritizing the target encoder in the implementation?
- Are there specific characteristics of V-JEPA that make the target encoder more suitable for downstream tasks, unlike other SSL approaches?
Would appreciate any insights into this design choice. Thanks!