Name		Name	Last commit message	Last commit date
parent directory ..
benchmarks		benchmarks
README.md		README.md
eva-g-p14_8xb16_in1k-336px.py		eva-g-p14_8xb16_in1k-336px.py
eva-g-p14_8xb16_in1k-560px.py		eva-g-p14_8xb16_in1k-560px.py
eva-g-p14_headless.py		eva-g-p14_headless.py
eva-g-p16_headless.py		eva-g-p16_headless.py
eva-l-p14_8xb16_in1k-196px.py		eva-l-p14_8xb16_in1k-196px.py
eva-l-p14_8xb16_in1k-336px.py		eva-l-p14_8xb16_in1k-336px.py
eva-l-p14_headless.py		eva-l-p14_headless.py
eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py		eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
metafile.yml		metafile.yml

README.md

EVA

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Abstract

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

How to use it?

Predict image

from mmpretrain import inference_model

predict = inference_model('vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])

Use the model

import torch
from mmpretrain import get_model

model = get_model('eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))

Train/Test Command

Prepare your dataset according to the docs.

Train:

python tools/train.py configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py

Test:

python tools/test.py configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth

Models and results

Pretrained models

Model	Params (M)	Flops (G)	Config	Download
`eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k`	111.78	17.58	config	model \| log
`beit-l-p14_3rdparty-eva_in21k`*	303.18	81.08	config	model
`beit-l-p14_eva-pre_3rdparty_in21k`*	303.18	81.08	config	model
`beit-g-p16_3rdparty-eva_30m`*	1011.32	203.52	config	model
`beit-g-p14_3rdparty-eva_30m`*	1011.60	267.17	config	model
`beit-g-p14_eva-30m-pre_3rdparty_in21k`*	1011.60	267.17	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Image Classification on ImageNet-1k

Model	Pretrain	Params (M)	Flops (G)	Top-1 (%)	Top-5 (%)	Config	Download
`vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k`	EVA MAE STYLE	86.57	17.58	83.70	N/A	config	model \| log
`vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k`	EVA MAE STYLE	86.57	17.58	69.00	N/A	config	model \| log
`beit-l-p14_eva-pre_3rdparty_in1k-196px`*	EVA	304.14	61.57	87.94	98.5	config	model
`beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px`*	EVA ImageNet-21k	304.14	61.57	88.58	98.65	config	model
`beit-l-p14_eva-pre_3rdparty_in1k-336px`*	EVA	304.53	191.10	88.66	98.75	config	model
`beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px`*	EVA ImageNet-21k	304.53	191.10	89.17	98.86	config	model
`beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px`*	EVA 30M ImageNet-21k	1013.01	620.64	89.61	98.93	config	model
`beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px`*	EVA 30M ImageNet-21k	1014.45	1906.76	89.71	98.96	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Citation

@article{EVA,
  title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
  author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2211.07636},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eva

eva

README.md

EVA

Abstract

How to use it?

Models and results

Pretrained models

Image Classification on ImageNet-1k

Citation

Files

eva

Directory actions

More options

Directory actions

More options

Latest commit

History

eva

Folders and files

parent directory

README.md

EVA

Abstract

How to use it?

Models and results

Pretrained models

Image Classification on ImageNet-1k

Citation