Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Download Link for VideoMAEv2 Pretraining Model Checkpoint #8

Closed
oooolga opened this issue Sep 8, 2024 · 9 comments
Closed

Comments

@oooolga
Copy link

oooolga commented Sep 8, 2024

Hi,

Can you confirm that the model provided in the code is the VideoMAE v2 model fine-tuned on the SSv2 dataset? Additionally, is a pre-trained (not fine-tuned) VideoMAE model available, and if so, can you provide the link?

Thank you for your help!

@songweige
Copy link
Owner

Hi Olga,

Yes! The provided code, by default, uses the checkpoint fine-tuned on ssv2, but it should be able to load any VideoMAE-v2 checkpoint. They do have a pre-trained VideoMAE model, which you can find here. Hope this helps!

@oooolga
Copy link
Author

oooolga commented Sep 8, 2024

Thanks for the swift reply!

@oooolga oooolga closed this as completed Sep 8, 2024
@oooolga
Copy link
Author

oooolga commented Sep 10, 2024

Hi Songwei,

I've downloaded the vit_g_hybrid_pt_1200e.pth model from here. However, when I try to use your model loader to load the model using the following lines:

from cdfvd.third_party.VideoMAEv2.utils import load_videomae_model
self.model = load_videomae_model(torch.device(device), 'vit_g_hybrid_pt_1200e.pth')

I've received the following error:
Error(s) in loading state_dict for VisionTransformer: Missing key(s) in state_dict: "patch_embed.proj.weight", "patch_embed.proj.bias", "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.attn.q_bias", "blocks.0.attn.v_bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.0.mlp.fc1.weight", "blocks.0.mlp.fc1.bias", "blocks.0.mlp.fc2.weight", "blocks.0.mlp.fc2.bias", ...

Can I use the load_video_mae_model function to load the vit_g_hybrid_pt_1200e.pth model, or is it only compatible with ssv2 models?

@oooolga oooolga reopened this Sep 10, 2024
@oooolga
Copy link
Author

oooolga commented Sep 10, 2024

Hi Songwei,

I've managed to resolve the loading issue of the pretrained models by modifying the load_videomae_model function and importing the pretrain_videomae_giant_patch14_224 model from cdfvd.third_party.VideoMAEv2.videomaev2_pretrain.

However, I'm now curious about the feature extraction process using the pretrained model, as described in your paper. Specifically, I'd like to know if the feature extraction was performed similarly to the following code snippet:
self.model.encoder.forward_features(videos*255, mask=...)

If so, could you please clarify what value was used for the mask parameter in this context? Was it a tensor of ones (unmasking all patches)?

Thank you for your time and assistance!

Olga

@oooolga
Copy link
Author

oooolga commented Sep 10, 2024

Updated question:
In your paper, you mentioned that features are extracted from the pretrained VideoMAE encoder-decoder architecture by taking the output of the prelogit layer in the encoder and averaging across all patches.

Based on this description, I'm wondering if the feature extraction code for the pretrained model is similar to the following:
self.model.encoder.forward_features(videos*255, torch.zeros(videos.shape[0],2048,1408).to(torch.bool)).mean(dim=1)

Could you please confirm if this is the correct interpretation of the feature extraction process described in your paper? Or if there's any discrepancy, can you please provide the correct code snippet for feature extraction using the pretrained VideoMAE model? Thank you!

@songweige
Copy link
Owner

Hi Olga, this is what I did before:

mask = torch.zeros([16, 2048, 1408]).to(torch.bool).cuda() if  'vit_g_hybrid_pt_1200e.pth' in ckpt_path else None
features = model.encoder.forward_features(input_data, mask=mask).mean(1)
stats.append_torch(features, num_gpus=1, rank=0)

It seems that the only difference is that the input range should be [0, 1] for the function model.encoder.forward_features?

@oooolga
Copy link
Author

oooolga commented Sep 10, 2024

Thanks for the clarification. Super helpful and will definitely check my input range. 😀

@oooolga oooolga closed this as completed Sep 11, 2024
@oooolga
Copy link
Author

oooolga commented Sep 12, 2024

@songweige
Thanks a ton, Songwei! You've saved us from a major bug in our code. I was under the impression that the input to the VideoMAE network was 0-255 when I saw lines 133 and 155. However, your comment made me realize that you had rescaled it to 0-1 in here - that was a huge catch!

I have a follow-up question regarding preprocessing, related to this issue: issue link. It appears that you didn't normalize the features using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. Am I correct?

Thanks again for all your help. We appreciate it and will definitely acknowledge your help in our project!

@songweige
Copy link
Owner

Hi Olga, thank you for your kind words and I think you are correct. I mainly followed this function to extract the features from the VideoMAE models and didn't check their training code before.

It looks like they did normalization as part of the augmentation during both training and fine-tuning. It would be good to know from the authors what is the proper way to do preprocessing during the inference!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants