Method of running evaluation on MSR-VTT dataset #58

sartaki · 2024-08-06T07:07:17Z

Thanks for the paper and the open sourcing the code base.

I would like to know how evaluation is performed on the MSR-VTT dataset for zero shot text to video retrieval.

Are the metrics reported for MSR-VTT for the entire test split (~ 2990 videos, 59800 captions) or for 1kA subset (~ 1000 videos or 20000 captions)?
- Section C in Appendix mentions use of the 1KA subset for MSR-VTT. Is this split used to report the results else where?
Are each of the caption (20 for each video) used to perform retrieval and find the recall metrics?
Are errors being accounted for?
- As the captions are not too descriptive and similar types of videos / captions exists, how are errors adjusted? For example, one of the caption for video7960 is a band performing in a small club but video8978 fits the same profile. Another caption for the same video video7960 is a group of boys and girls are dancing but video9957 also can be considered correct if retrieved. I will be happy to provide more such examples

Looking forward for your clarification. Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback