Msr vtt dataset have 10000 videos and 20 captions for each video but in this implementation only a video-caption pair in train phase is considered. Therefore in total <= 10000 example for train.
someone has seen the same thing????
has anyone changed the code?
Msr vtt dataset have 10000 videos and 20 captions for each video but in this implementation only a video-caption pair in train phase is considered. Therefore in total <= 10000 example for train.
someone has seen the same thing????
has anyone changed the code?