[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Ziyun Zeng*, Yuying Ge*, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

This repo is the official implementation of the paper Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.

Main Results

Transferability Evaluation

Action Recognition

Text-to-Video Retrieval

Instruction

Environment Setup

Before you start, run the following command to set up your Python environment.

pip install -r requirement.txt

Dataset Preparation

Pre-training Datasets

Download YT-Temporal from here, and put the dataset under the folder data/YTTemporal.
Download WebVid-2M from here, and put the dataset under the folder data/WebVid.
Download CC3M from here, and put the dataset under the folder data/CC3M.
Download the split file from here, and unzip it in the root directory.

Downstream Datasets

Download SSV2 from here, and put the dataset under the folder data/SSV2.

Training and Evaluation

We use 32 NVIDIA V100 GPUs for pre-training and downstream evaluation. The detailed hyper-parameters can be found in the Appendix.

Pre-training

Run the following script to pre-train the model on the YT-Temporal dataset. You need to download the official ImageMAE-Base weights for initialization.
```
bash scripts/train_yt.sh
```
Run the following script to jointly post-pretrain the model on the CC3M and WebVid-2M datasets. Note that you need to specify the variable “load_checkpoint” in configs/dist-cc-web-pt.json to the checkpoint path of the YT-Temporal pre-trained model.
```
bash scripts/train_cc_web.sh
```

Downstream Evaluation

We have released our pre-trained model on Google Drive in the following links to quickly reproduce the results reported in our paper.

YT-Temporal: https://drive.google.com/file/d/1JthEHg1ETHp5phHzjuhR1H8SfBlousYD/view?usp=sharing
YT-Temporal + CC3M + WebVid-2M: https://drive.google.com/file/d/19WOHhJZfDtqLvzK_g_Kwr6om1hoMowMe/view?usp=sharing

Run the following scripts to evaluate different tasks on the SSV2 dataset.

Zero-shot Video Retrieval (Only supports single GPU evaluation currently)
```
bash scripts/zero_ssv2.sh
```
Linear Probe (About 7-8 hours on 32 NVIDIA V100 GPUs)
```
bash scripts/linear_ssv2.sh
```
Fine-tuning (About 7-8 hours on 32 NVIDIA V100 GPUs)
```
bash scripts/ft_ssv2.sh
```

Acknowledgement

The pre-training code is based on the official implementation of Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.
The downstream evaluation code is based on the official implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.

Citation

If you find our work helps, please cite our paper.

@InProceedings{Zeng_2023_CVPR,
    author    = {Zeng, Ziyun and Ge, Yuying and Liu, Xihui and Chen, Bin and Luo, Ping and Xia, Shu-Tao and Ge, Yixiao},
    title     = {Learning Transferable Spatiotemporal Representations From Natural Script Knowledge},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23079-23089}
}

License

This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Main Results

Transferability Evaluation

Action Recognition

Text-to-Video Retrieval

Instruction

Environment Setup

Dataset Preparation

Pre-training Datasets

Downstream Datasets

Training and Evaluation

Pre-training

Downstream Evaluation

Acknowledgement

Citation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Main Results

Transferability Evaluation

Action Recognition

Text-to-Video Retrieval

Instruction

Environment Setup

Dataset Preparation

Pre-training Datasets

Downstream Datasets

Training and Evaluation

Pre-training

Downstream Evaluation

Acknowledgement

Citation

License