Performance of VIT-B/32 is worse than RN50 on CC3M #56

JACKHAHA363 · 2021-09-08T16:40:25Z

JACKHAHA363
Sep 8, 2021

Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?

carlini · 2021-09-08T17:42:57Z

carlini
Sep 8, 2021
Maintainer

ViT-B performed worse for us on CC than a RN50. I suspect (but can not prove) this is because there's not enough data and vision transformers appear more data hungry than resnets. I don't have the accuracy off hand, but this looks comparable to what we were seeing.

0 replies

rwightman · 2022-04-06T00:36:06Z

rwightman
Apr 6, 2022
Maintainer

This is expected and your numbers appear reasonable. In training quite a few models at the lower end recently, the ViT-B models (even the smaller ones) will underperform similar sized ResNet models for smaller data. This includes up to at least the 12-15M sample range as I was unable to push ViT-B-32 past RN50 on cc12m or yfcc15m. I feel the crossover point is probably in the 40-100M sample range but have not verified that.

One could possibly work around this by using a pretrained backbone for the vision tower. There is partial support for this right now in some (preliminary) support for timm models...

One could modify a model config such as this to enable the timm_model_pretrained flag https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/timm-vit_base_patch32_224.json

You'd then be starting with a vision tower pretrained on imagenet. It significantly speeds up reaching decent zero shot and eval rseults BUT I'd caution against using an imagenet pretrained backbone and doing zero_shot eval on Imagenet, you'd probably want an alternate zero shot test dataset

1 reply

kyleliang919 Sep 6, 2022

@rwightman Do you mind sharing some of the numbers for ViT-B-32 on yfcc15m? Somehow I never got it over 15% with 32 epochs and 32K batch size and default params from the paper. I am just wondering if there is something wrong with my set up.

rwightman · 2022-04-06T00:37:31Z

rwightman
Apr 6, 2022
Maintainer

Moving to discussion for future reference

0 replies

kyleliang919 · 2023-01-05T19:04:56Z

kyleliang919
Jan 5, 2023

@rwightman Not sure if it's still relevant. For ViT-B/32, if you train it with longer epochs, there might be a chance to push it beyond convnets on smaller data domain such as YFCC 15M. I am seeing close to 30 top-1 imagenet1k acc by training 512 epochs instead of 32. However, because of the learning rate scheduler, it plateaued in the end.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of VIT-B/32 is worse than RN50 on CC3M #56

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance of VIT-B/32 is worse than RN50 on CC3M #56

JACKHAHA363 Sep 8, 2021

Replies: 4 comments · 1 reply

carlini Sep 8, 2021 Maintainer

rwightman Apr 6, 2022 Maintainer

kyleliang919 Sep 6, 2022

rwightman Apr 6, 2022 Maintainer

kyleliang919 Jan 5, 2023

JACKHAHA363
Sep 8, 2021

Replies: 4 comments 1 reply

carlini
Sep 8, 2021
Maintainer

rwightman
Apr 6, 2022
Maintainer

rwightman
Apr 6, 2022
Maintainer

kyleliang919
Jan 5, 2023