[Feature] Implement of RAM with a gradio interface. (#1802)

* [CodeCamp2023-584]Support DINO self-supervised learning in project (#1756) * feat: impelemt DINO * chore: delete debug code * chore: impplement pre-commit * fix: fix imported package * chore: pre-commit check * [CodeCamp2023-340] New Version of config Adapting MobileNet Algorithm (#1774) * add new config adapting MobileNetV2,V3 * add base model config for mobile net v3, modified all training configs of mobile net v3 inherit from the base model config * removed directory _base_/models/mobilenet_v3 * [Feature] Implement of Zero-Shot CLIP Classifier (#1737) * zero-shot CLIP * modify zero-shot clip config * add in1k_sub_prompt(8 prompts) for improvement * add some annotations doc * clip base class & clip_zs sub-class * some modifications of details after review * convert into and use mmpretrain-vit * modify names of some files and directories * ram init commit * [Fix] Fix pipeline bug in image retrieval inferencer * [CodeCamp2023-341] 多模态数据集文档补充-COCO Retrieval * Update OFA to compat with latest huggingface. * Update train.py to compat with new config * Bump version to v1.1.0 * Update __init__.py --------- Co-authored-by: LALBJ <[email protected]> Co-authored-by: DE009 <[email protected]> Co-authored-by: mzr1996 <[email protected]> Co-authored-by: 飞飞 <[email protected]>
open-mmlab · Oct 25, 2023 · ed5924b · ed5924b
1 parent c076651
commit ed5924b
Show file tree

Hide file tree

Showing 69 changed files with 4,618 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -86,13 +86,10 @@ https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351
 
 ## What's new
 
-🌟 v1.0.2 was released in 15/08/2023
+🌟 v1.1.0 was released in 12/10/2023
 
-Support [MFF](./configs/mff/) self-supervised algorithm and enhance the codebase. More details can be found in the [changelog](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html).
-
-🌟 v1.0.1 was released in 28/07/2023
-
-Fix some bugs and enhance the codebase. Please refer to [changelog](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) for more details.
+- Support Mini-GPT4 training and provide a Chinese model (based on Baichuan-7B)
+- Support zero-shot classification based on CLIP.
 
 🌟 v1.0.0 was released in 04/07/2023
 

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -84,13 +84,10 @@ https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351
 
 ## 更新日志
 
-🌟 2023/8/15 发布了 v1.0.2 版本
+🌟 2023/10/12 发布了 v1.1.0 版本
 
-支持了 [MFF](./configs/mff/) 自监督算法，增强算法库功能。细节请参考 [更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html)。
-
-🌟 2023/7/28 发布了 v1.0.1 版本
-
-修复部分 bug 和增强算法库功能。细节请参考 [更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html)。
+- 支持 Mini-GPT4 训练并提供一个基于 Baichuan-7B 的中文模型
+- 支持基于 CLIP 的零样本分类。
 
 🌟 2023/7/4 发布了 v1.0.0 版本
 

diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=512,
+        layers=12,
+        heads=8,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-base-patch16',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=512,
+    proj_dim=512,
+    text_prototype='cifar100',
+    text_prompt='openai_cifar100',
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=512,
+        layers=12,
+        heads=8,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-base-patch16',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=512,
+    proj_dim=512,
+    text_prototype='imagenet',
+    text_prompt='openai_imagenet_sub',  # openai_imagenet, openai_imagenet_sub
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=768,
+        layers=12,
+        heads=12,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-large-patch14',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=768,
+    proj_dim=768,
+    text_prototype='cifar100',
+    text_prompt='openai_cifar100',
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=768,
+        layers=12,
+        heads=12,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-large-patch14',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=768,
+    proj_dim=768,
+    text_prototype='imagenet',
+    text_prompt='openai_imagenet_sub',  # openai_imagenet, openai_imagenet_sub
+    context_length=77,
+)
diff --git a/docker/serve/Dockerfile b/docker/serve/Dockerfile
@@ -1,9 +1,9 @@
-ARG PYTORCH="1.12.1"
-ARG CUDA="11.3"
+ARG PYTORCH="2.0.1"
+ARG CUDA="11.7"
 ARG CUDNN="8"
 FROM pytorch/torchserve:latest-gpu
 
-ARG MMPRE="1.0.2"
+ARG MMPRE="1.1.0"
 
 ENV PYTHONUNBUFFERED TRUE
 

diff --git a/docs/en/notes/changelog.md b/docs/en/notes/changelog.md
@@ -1,5 +1,27 @@
 # Changelog (MMPreTrain)
 
+## v1.1.0(12/10/2023)
+
+### New Features
+
+- [Feature] Implement of Zero-Shot CLIP Classifier ([#1737](https://github.com/open-mmlab/mmpretrain/pull/1737))
+- [Feature] Add minigpt4 gradio demo and training script. ([#1758](https://github.com/open-mmlab/mmpretrain/pull/1758))
+
+### Improvements
+
+- [Config] New Version of config Adapting MobileNet Algorithm ([#1774](https://github.com/open-mmlab/mmpretrain/pull/1774))
+- [Config] Support DINO self-supervised learning in project ([#1756](https://github.com/open-mmlab/mmpretrain/pull/1756))
+- [Config] New Version of config Adapting Swin Transformer Algorithm ([#1780](https://github.com/open-mmlab/mmpretrain/pull/1780))
+- [Enhance] Add iTPN Supports for Non-three channel image ([#1735](https://github.com/open-mmlab/mmpretrain/pull/1735))
+- [Docs] Update dataset download script from opendatalab to openXlab ([#1765](https://github.com/open-mmlab/mmpretrain/pull/1765))
+- [Docs] Update COCO-Retrieval dataset docs. ([#1806](https://github.com/open-mmlab/mmpretrain/pull/1806))
+
+### Bug Fix
+
+- Update `train.py` to compat with new config.
+- Update OFA module to compat with the latest huggingface.
+- Fix pipeline bug in ImageRetrievalInferencer.
+
 ## v1.0.2(15/08/2023)
 
 ### New Features

diff --git a/docs/en/notes/faq.md b/docs/en/notes/faq.md
@@ -16,7 +16,7 @@ and make sure you fill in all required information in the template.
 
   | MMPretrain version | MMEngine version  |   MMCV version   |
   | :----------------: | :---------------: | :--------------: |
-  |    1.0.2 (main)    | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
+  |    1.1.0 (main)    | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
   |       1.0.0        | mmengine >= 0.8.0 |  mmcv >= 2.0.0   |
   |      1.0.0rc8      | mmengine >= 0.7.1 | mmcv >= 2.0.0rc4 |
   |      1.0.0rc7      | mmengine >= 0.5.0 | mmcv >= 2.0.0rc4 |

diff --git a/docs/zh_CN/notes/faq.md b/docs/zh_CN/notes/faq.md
@@ -13,7 +13,7 @@
 
   | MMPretrain 版本 |   MMEngine 版本   |    MMCV 版本     |
   | :-------------: | :---------------: | :--------------: |
-  |  1.0.2 (main)   | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
+  |  1.1.0 (main)   | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
   |      1.0.0      | mmengine >= 0.8.0 |  mmcv >= 2.0.0   |
   |    1.0.0rc8     | mmengine >= 0.7.1 | mmcv >= 2.0.0rc4 |
   |    1.0.0rc7     | mmengine >= 0.5.0 | mmcv >= 2.0.0rc4 |

diff --git a/mmpretrain/__init__.py b/mmpretrain/__init__.py
@@ -7,7 +7,7 @@
 from .version import __version__
 
 mmcv_minimum_version = '2.0.0'
-mmcv_maximum_version = '2.1.0'
+mmcv_maximum_version = '2.2.0'
 mmcv_version = digit_version(mmcv.__version__)
 
 mmengine_minimum_version = '0.8.3'

diff --git a/mmpretrain/apis/image_retrieval.py b/mmpretrain/apis/image_retrieval.py
@@ -108,6 +108,7 @@ def build_dataloader(dataset):
             # A config of dataset
             from mmpretrain.registry import DATASETS
             test_pipeline = [dict(type='LoadImageFromFile'), self.pipeline]
+            prototype.setdefault('pipeline', test_pipeline)
             dataset = DATASETS.build(prototype)
             dataloader = build_dataloader(dataset)
         elif isinstance(prototype, DataLoader):