Skip to content

Latest commit

 

History

History
117 lines (70 loc) · 4.08 KB

README.md

File metadata and controls

117 lines (70 loc) · 4.08 KB

[CVPR 2023] Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Ziyun Zeng*, Yuying Ge*, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

This repo is the official implementation of the paper Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.

Fig2

Main Results

Transferability Evaluation

Tab4

Action Recognition

Tab5-7

Text-to-Video Retrieval

Tab6

Instruction

Environment Setup

Before you start, run the following command to set up your Python environment.

pip install -r requirement.txt

Dataset Preparation

Pre-training Datasets

  1. Download YT-Temporal from here, and put the dataset under the folder data/YTTemporal.
  2. Download WebVid-2M from here, and put the dataset under the folder data/WebVid.
  3. Download CC3M from here, and put the dataset under the folder data/CC3M.
  4. Download the split file from here, and unzip it in the root directory.

Downstream Datasets

  1. Download SSV2 from here, and put the dataset under the folder data/SSV2.

Training and Evaluation

We use 32 NVIDIA V100 GPUs for pre-training and downstream evaluation. The detailed hyper-parameters can be found in the Appendix.

Pre-training

  1. Run the following script to pre-train the model on the YT-Temporal dataset. You need to download the official ImageMAE-Base weights for initialization.

    bash scripts/train_yt.sh
  2. Run the following script to jointly post-pretrain the model on the CC3M and WebVid-2M datasets. Note that you need to specify the variable “load_checkpoint” in configs/dist-cc-web-pt.json to the checkpoint path of the YT-Temporal pre-trained model.

    bash scripts/train_cc_web.sh

Downstream Evaluation

We have released our pre-trained model on Google Drive in the following links to quickly reproduce the results reported in our paper.

  1. YT-Temporal: https://drive.google.com/file/d/1JthEHg1ETHp5phHzjuhR1H8SfBlousYD/view?usp=sharing
  2. YT-Temporal + CC3M + WebVid-2M: https://drive.google.com/file/d/19WOHhJZfDtqLvzK_g_Kwr6om1hoMowMe/view?usp=sharing

Run the following scripts to evaluate different tasks on the SSV2 dataset.

  1. Zero-shot Video Retrieval (Only supports single GPU evaluation currently)

    bash scripts/zero_ssv2.sh
  2. Linear Probe (About 7-8 hours on 32 NVIDIA V100 GPUs)

    bash scripts/linear_ssv2.sh
  3. Fine-tuning (About 7-8 hours on 32 NVIDIA V100 GPUs)

    bash scripts/ft_ssv2.sh

Acknowledgement

Citation

If you find our work helps, please cite our paper.

@InProceedings{Zeng_2023_CVPR,
    author    = {Zeng, Ziyun and Ge, Yuying and Liu, Xihui and Chen, Bin and Luo, Ping and Xia, Shu-Tao and Ge, Yixiao},
    title     = {Learning Transferable Spatiotemporal Representations From Natural Script Knowledge},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23079-23089}
}

License

This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.