Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

oooolga · 2024-09-08T00:01:37Z

Hi,

Can you confirm that the model provided in the code is the VideoMAE v2 model fine-tuned on the SSv2 dataset? Additionally, is a pre-trained (not fine-tuned) VideoMAE model available, and if so, can you provide the link?

Thank you for your help!

songweige · 2024-09-08T00:42:16Z

Hi Olga,

Yes! The provided code, by default, uses the checkpoint fine-tuned on ssv2, but it should be able to load any VideoMAE-v2 checkpoint. They do have a pre-trained VideoMAE model, which you can find here. Hope this helps!

oooolga · 2024-09-08T00:47:40Z

Thanks for the swift reply!

oooolga · 2024-09-10T03:53:46Z

~~Hi Songwei,~~

I've downloaded the vit_g_hybrid_pt_1200e.pth model from here. However, when I try to use your model loader to load the model using the following lines:

from cdfvd.third_party.VideoMAEv2.utils import load_videomae_model self.model = load_videomae_model(torch.device(device), 'vit_g_hybrid_pt_1200e.pth')

I've received the following error:
Error(s) in loading state_dict for VisionTransformer: Missing key(s) in state_dict: "patch_embed.proj.weight", "patch_embed.proj.bias", "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.attn.q_bias", "blocks.0.attn.v_bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.0.mlp.fc1.weight", "blocks.0.mlp.fc1.bias", "blocks.0.mlp.fc2.weight", "blocks.0.mlp.fc2.bias", ...

~~Can I use the load_video_mae_model function to load the vit_g_hybrid_pt_1200e.pth model, or is it only compatible with ssv2 models?~~

oooolga · 2024-09-10T05:50:23Z

Hi Songwei,

I've managed to resolve the loading issue of the pretrained models by modifying the load_videomae_model function and importing the pretrain_videomae_giant_patch14_224 model from cdfvd.third_party.VideoMAEv2.videomaev2_pretrain.

However, I'm now curious about the feature extraction process using the pretrained model, as described in your paper. Specifically, I'd like to know if the feature extraction was performed similarly to the following code snippet:
self.model.encoder.forward_features(videos*255, mask=...)

If so, could you please clarify what value was used for the mask parameter in this context? Was it a tensor of ones (unmasking all patches)?

Thank you for your time and assistance!

Olga

oooolga · 2024-09-10T06:07:15Z

Updated question:
In your paper, you mentioned that features are extracted from the pretrained VideoMAE encoder-decoder architecture by taking the output of the prelogit layer in the encoder and averaging across all patches.

Based on this description, I'm wondering if the feature extraction code for the pretrained model is similar to the following:
self.model.encoder.forward_features(videos*255, torch.zeros(videos.shape[0],2048,1408).to(torch.bool)).mean(dim=1)

Could you please confirm if this is the correct interpretation of the feature extraction process described in your paper? Or if there's any discrepancy, can you please provide the correct code snippet for feature extraction using the pretrained VideoMAE model? Thank you!

songweige · 2024-09-10T15:13:23Z

Hi Olga, this is what I did before:

mask = torch.zeros([16, 2048, 1408]).to(torch.bool).cuda() if  'vit_g_hybrid_pt_1200e.pth' in ckpt_path else None
features = model.encoder.forward_features(input_data, mask=mask).mean(1)
stats.append_torch(features, num_gpus=1, rank=0)

It seems that the only difference is that the input range should be [0, 1] for the function model.encoder.forward_features?

oooolga · 2024-09-10T16:59:09Z

Thanks for the clarification. Super helpful and will definitely check my input range. 😀

oooolga · 2024-09-12T05:16:00Z

@songweige
Thanks a ton, Songwei! You've saved us from a major bug in our code. I was under the impression that the input to the VideoMAE network was 0-255 when I saw lines 133 and 155. However, your comment made me realize that you had rescaled it to 0-1 in here - that was a huge catch!

I have a follow-up question regarding preprocessing, related to this issue: issue link. It appears that you didn't normalize the features using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. Am I correct?

Thanks again for all your help. We appreciate it and will definitely acknowledge your help in our project!

songweige · 2024-09-12T16:49:22Z

Hi Olga, thank you for your kind words and I think you are correct. I mainly followed this function to extract the features from the VideoMAE models and didn't check their training code before.

It looks like they did normalization as part of the augmentation during both training and fine-tuning. It would be good to know from the authors what is the proper way to do preprocessing during the inference!

oooolga closed this as completed Sep 8, 2024

oooolga reopened this Sep 10, 2024

oooolga closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

oooolga commented Sep 8, 2024

songweige commented Sep 8, 2024

oooolga commented Sep 8, 2024

oooolga commented Sep 10, 2024 •

edited

Loading

oooolga commented Sep 10, 2024

oooolga commented Sep 10, 2024 •

edited

Loading

songweige commented Sep 10, 2024

oooolga commented Sep 10, 2024

oooolga commented Sep 12, 2024 •

edited

Loading

songweige commented Sep 12, 2024

Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

Comments

oooolga commented Sep 8, 2024

songweige commented Sep 8, 2024

oooolga commented Sep 8, 2024

oooolga commented Sep 10, 2024 • edited Loading

oooolga commented Sep 10, 2024

oooolga commented Sep 10, 2024 • edited Loading

songweige commented Sep 10, 2024

oooolga commented Sep 10, 2024

oooolga commented Sep 12, 2024 • edited Loading

songweige commented Sep 12, 2024

oooolga commented Sep 10, 2024 •

edited

Loading

oooolga commented Sep 10, 2024 •

edited

Loading

oooolga commented Sep 12, 2024 •

edited

Loading