number of train caption is < 10000

Msr vtt dataset have 10000 videos and 20 captions for each  video but in this implementation only a video-caption pair in train phase is considered. Therefore in total <= 10000 example for train.
someone has seen the same thing???? 
has anyone changed the code?