Performance of VIT-B/32 is worse than RN50 on CC3M #56
Replies: 4 comments 1 reply
-
ViT-B performed worse for us on CC than a RN50. I suspect (but can not prove) this is because there's not enough data and vision transformers appear more data hungry than resnets. I don't have the accuracy off hand, but this looks comparable to what we were seeing. |
Beta Was this translation helpful? Give feedback.
-
This is expected and your numbers appear reasonable. In training quite a few models at the lower end recently, the ViT-B models (even the smaller ones) will underperform similar sized ResNet models for smaller data. This includes up to at least the 12-15M sample range as I was unable to push ViT-B-32 past RN50 on cc12m or yfcc15m. I feel the crossover point is probably in the 40-100M sample range but have not verified that. One could possibly work around this by using a pretrained backbone for the vision tower. There is partial support for this right now in some (preliminary) support for One could modify a model config such as this to enable the You'd then be starting with a vision tower pretrained on imagenet. It significantly speeds up reaching decent zero shot and eval rseults BUT I'd caution against using an imagenet pretrained backbone and doing zero_shot eval on Imagenet, you'd probably want an alternate zero shot test dataset |
Beta Was this translation helpful? Give feedback.
-
Moving to discussion for future reference |
Beta Was this translation helpful? Give feedback.
-
@rwightman Not sure if it's still relevant. For ViT-B/32, if you train it with longer epochs, there might be a chance to push it beyond convnets on smaller data domain such as YFCC 15M. I am seeing close to 30 top-1 imagenet1k acc by training 512 epochs instead of 32. However, because of the learning rate scheduler, it plateaued in the end. |
Beta Was this translation helpful? Give feedback.
-
Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?
Beta Was this translation helpful? Give feedback.
All reactions