You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the paper and the open sourcing the code base.
I would like to know how evaluation is performed on the MSR-VTT dataset for zero shot text to video retrieval.
Are the metrics reported for MSR-VTT for the entire test split (~ 2990 videos, 59800 captions) or for 1kA subset (~ 1000 videos or 20000 captions)?
Section C in Appendix mentions use of the 1KA subset for MSR-VTT. Is this split used to report the results else where?
Are each of the caption (20 for each video) used to perform retrieval and find the recall metrics?
Are errors being accounted for?
As the captions are not too descriptive and similar types of videos / captions exists, how are errors adjusted? For example, one of the caption for video7960 is a band performing in a small club but video8978 fits the same profile. Another caption for the same video video7960 is a group of boys and girls are dancing but video9957 also can be considered correct if retrieved. I will be happy to provide more such examples
Looking forward for your clarification. Thanks!
The text was updated successfully, but these errors were encountered:
Thanks for the paper and the open sourcing the code base.
I would like to know how evaluation is performed on the MSR-VTT dataset for zero shot text to video retrieval.
video7960
isa band performing in a small club
butvideo8978
fits the same profile. Another caption for the same videovideo7960
isa group of boys and girls are dancing
butvideo9957
also can be considered correct if retrieved. I will be happy to provide more such examplesLooking forward for your clarification. Thanks!
The text was updated successfully, but these errors were encountered: