This repository contains the source code for our paper:
Convolutional Networks with Oriented 1D Kernels
ICCV 2023
Alexandre Kirchmeyer, Jia Deng
We provide results, installation instructions, dataset preparation, and training / evaluation scripts in the following sections:
If you find this repository helpful, please consider citing:
@article{kirchmeyer2023convolutional,
title={Convolutional Networks with Oriented 1D Kernels},
author={Kirchmeyer, Alexandre and Deng, Jia},
journal={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2023}
}
The link for all our pre-trained models can be found at: https://drive.google.com/drive/folders/160d6dPKXIgdmBRPTBv7-nk2jh8cGsaa2?usp=sharing
Download them in the pretrained
folder.
name | resolution | EMA acc@1 | acc@1 | #params | FLOPs | model |
---|---|---|---|---|---|---|
ConvNeXt-T (rep.) | 224x224 | 82.0 | 81.8 | 28.6M | 4.5G | model |
ConvNeXt-T-1D | 224x224 | 82.2 | 82.2 | 28.5M | 4.4G | model |
ConvNeXt-T-1D++ | 224x224 | 82.7 | 82.4 | 29.2M | 4.7G | model |
ConvNeXt-T-2D++ | 224x224 | 82.5 | 82.3 | 29.2M | 4.8G | model |
ConvNeXt-B (rep.) | 224x224 | 83.7 | 83.6 | 89M | 15.4G | model |
ConvNeXt-B-1D | 224x224 | 83.8 | 83.6 | 88M | 15.3G | model |
ConvNeXt-B-1D++ | 224x224 | 83.8 | 83.5 | 90M | 15.8G | model |
ConvNeXt-B-2D++ | 224x224 | 84.0 | 83.6 | 91M | 15.9G | model |
To run our models, please create a new conda environment and install the following dependencies.
conda create -n convnext python=3.8
Install Pytorch and torchvision following official instructions. For example:
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
Then install the following dependencies:
pip install ninja # for cuda kernel compilation
pip install timm tensorboardX six # for convnext
And install our cuda kernels:
# Install cuda kernels
cd models/dwoconv1d
pip install -e .
cd ../../models/dwoconv1d_specialized
pip install -e .
cd ../../models/dwoconv1d_reference
pip install -e .
cd ../..
The installation instructions were adapted from: https://github.com/facebookresearch/ConvNeXt/blob/main/INSTALL.md
Download the ImageNet-1K classification dataset and add a reference of the dataset in the data
folder:
mkdir -p data
ln -s path/to/imagenet/dataset data/imagenet
To train a ConvNeXt-1D/2D/1D++/2D++ image classifier, run:
# single-gpu
SPECIALIZED=<1/0> NGPUS=1 VARIANT=<1d/2d/1dpp/2dpp> BATCH_SIZE=<batch_size> MODEL=<convnext_tiny/convnext_base> EXPERIMENT_NAME=<exp name> ./scripts/train_image_classification.sh
# multi-gpu
SPECIALIZED=<1/0> NGPUS=8 VARIANT=<1d/2d/1dpp/2dpp> BATCH_SIZE=<batch_size> MODEL=<convnext_tiny/convnext_base> EXPERIMENT_NAME=<exp name> ./scripts/train_image_classification.sh
For example, to train a ConvNeXt-1D-T image classifier on 8 GPUs with a batch size of 64, run:
SPECIALIZED=0 NGPUS=8 VARIANT=1d BATCH_SIZE=64 MODEL=convnext_tiny EXPERIMENT_NAME=convnext_tiny_1d ./scripts/train_image_classification.sh
To evaluate a pre-trained ConvNeXt-1D/2D/1D++/2D++ image classifier, run:
# non-EMA single-gpu
SPECIALIZED=<1/0> NGPUS=1 USE_EMA=0 VARIANT=<1d/2d/1dpp/2dpp> BATCH_SIZE=<batch_size> MODEL=<convnext_tiny/convnext_base> CHECKPOINT=<CHECKPOINT> ./scripts/eval_image_classification.sh
# EMA multi-gpu
SPECIALIZED=<1/0> NGPUS=8 USE_EMA=1 VARIANT=<1d/2d/1dpp/2dpp> BATCH_SIZE=<batch_size> MODEL=<convnext_tiny/convnext_base> CHECKPOINT=<CHECKPOINT> ./scripts/eval_image_classification.sh
For example, to evaluate the non-EMA and EMA accuracy of a pre-trained ConvNeXt-1D-T image classifier on 1 GPU, run:
# non-EMA
SPECIALIZED=0 NGPUS=1 USE_EMA=0 VARIANT=1d BATCH_SIZE=64 MODEL=convnext_tiny CHECKPOINT=$PWD/pretrained/image_classification/convnext_tiny_1d_best.pth ./scripts/eval_image_classification.sh
# EMA
SPECIALIZED=0 NGPUS=1 USE_EMA=1 VARIANT=1d BATCH_SIZE=64 MODEL=convnext_tiny CHECKPOINT=$PWD/pretrained/image_classification/convnext_tiny_1d_best_ema.pth ./scripts/eval_image_classification.sh
name | Method | AP_box | AP_mask | #params | FLOPs | model |
---|---|---|---|---|---|---|
ConvNeXt-T (rep.) | Cascade Mask R-CNN | 50.2 | 43.6 | 86M | 741G | model |
ConvNeXt-T-1D | Cascade Mask R-CNN | 50.3 | 43.6 | 86M | 739G | model |
ConvNeXt-T-1D++ | Cascade Mask R-CNN | 51.3 | 44.5 | 86M | 739G | model |
ConvNeXt-T-2D++ | Cascade Mask R-CNN | 51.2 | 44.4 | 87M | 741G | model |
ConvNeXt-B (rep.) | Cascade Mask R-CNN | 52.4 | 45.2 | 146M | 964G | model |
ConvNeXt-B-1D | Cascade Mask R-CNN | 52.8 | 45.7 | 145M | 960G | model |
ConvNeXt-B-1D++ | Cascade Mask R-CNN | 52.9 | 46.0 | 147M | 960G | model |
ConvNeXt-B-2D++ | Cascade Mask R-CNN | 52.9 | 45.8 | 148M | 964G | model |
In order to use our pre-trained classifiers as backbones for downstream tasks, we first need to convert the weights to backbone weights as our backbones use slightly smaller kernel sizes. This is because image size is fixed during ImageNet training but they can be arbitrarily big in semantic segmentation / object detection. Because we train with kernel sizes of up to 1x31, which can represent more than 2x the size of the input, this means that some of these weights will be uninitialized. This essentially makes predictions random for larger image sizes. As a result, we "crop" kernel sizes for downstream tasks and provide a conversion script to convert from image classifier weights to object detection weights and use those for our object detection and semantic segmentation backbones.
To convert the weights, run:
python -m scripts.convert_to_backbone_weights --model convnext_tiny --variant 1d --checkpoint pretrained/image_classification/convnext_tiny_1d_best.pth --output_dir pretrained/backbones
To run our models on object detection and semantic segmentation, create the following conda environment.
Note that you can use the same environment for semantic segmentation.
conda create -n mmlab python=3.7 -y
conda activate mmlab
conda install pytorch=1.7.0 torchvision=0.8.0 torchaudio=0.7.0 cudatoolkit=11.0 -c pytorch -y
pip install cython==0.29.33
pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.7.0/index.html
pip install mmdet==2.20.0
pip install scipy mmsegmentation==0.11.0 timm==0.9.7 fvcore
# Install kernels
cd models/dwoconv1d
pip install -e .
cd ../../models/dwoconv1d_specialized
pip install -e .
cd ../../models/dwoconv1d_reference
pip install -e .
cd ../..
Then clone the repository Swin Detection (commit 6a979e2
) and copy our files from object_detection
into the folder.
Note: This requires a CUDA runtime version < 12 to run (our codebase which is based on ConvNeXt uses MMCV version <= 1.3.0 which only supports CUDA <= 11)
Note that we have only tested object detection on an NVidia RTX 3090. It may not work on older generation GPUs.
Note: When running our models, if you run into the CUDA error invalid argument
, this means that your GPU does not have enough shared memory. It can help to increase Sb
in dwoconv1d.py
to Sb = 8
for example.
To run training and inference, download the COCO dataset and add a reference to the dataset in the object_detection/data
folder:
mkdir -p object_detection/data
ln -s path/to/coco/dataset object_detection/data/coco
For more details, check: https://github.com/open-mmlab/mmdetection/blob/v2.11.0/docs/get_started.md
To train an object detector with a ConvNeXt-1D/2D/1D++/2D++ backbone, run:
# single-gpu training
NGPUS=1 CONFIG=<CONFIG_FILE> PRETRAINED=<PRETRAIN MODEL> EXPERIMENT_NAME=<RUN NAME> ./scripts/train_object_detection.sh
# multi-gpu training
NGPUS=8 CONFIG=<CONFIG_FILE> PRETRAINED=<PRETRAIN MODEL> EXPERIMENT_NAME=<RUN NAME> ./scripts/train_object_detection.sh
For example, to train a Cascade Mask R-CNN model with a ConvNeXt-1D-T backbone and 8 gpus, run:
NGPUS=8 CONFIG=cascade_mask_rcnn_convnext_tiny_1d_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco_in1k PRETRAINED=$PWD/pretrained/backbones/convnext_tiny_1d_best.pth EXPERIMENT_NAME=cascade_mask_rcnn_convnext_tiny_1d ./scripts/train_object_detection.sh
To evaluate an object detector with a ConvNeXt-1D/2D/1D++/2D++ backbone, run:
# single-gpu evaluation
NGPUS=1 CONFIG=<CONFIG_FILE> CHECKPOINT=<CHECKPOINT MODEL> ./scripts/eval_semantic_segmentation.sh
# multi-gpu evaluation
NGPUS=8 CONFIG=<CONFIG_FILE> CHECKPOINT=<CHECKPOINT MODEL> ./scripts/eval_semantic_segmentation.sh
For example, to evaluate a Cascade Mask R-CNN model with a ConvNeXt-1D-T backbone and 8 gpus, run:
NGPUS=8 CONFIG=cascade_mask_rcnn_convnext_tiny_1d_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco_in1k CHECKPOINT=$PWD/pretrained/object_detection/cascade_mask_rcnn_convnext_tiny_1d_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco_in1k_seed0_epoch_36.pth ./scripts/eval_object_detection.sh
name | Method | mIoU (ms) | #params | FLOPs | model |
---|---|---|---|---|---|
ConvNeXt-T (rep.) | UPerNet | 46.6 | 60M | 939G | model |
ConvNeXt-T-1D | UPerNet | 45.2 | 60M | 927G | model |
ConvNeXt-T-1D++ | UPerNet | 47.4 | 60M | 927G | model |
ConvNeXt-T-2D++ | UPerNet | 48.1 | 61M | 939G | model |
ConvNeXt-B (rep.) | UPerNet | 49.4 | 146M | 964G | model |
ConvNeXt-B-1D | UPerNet | 49.4 | 145M | 960G | model |
ConvNeXt-B-1D++ | UPerNet | 50.7 | 147M | 960G | model |
ConvNeXt-B-2D++ | UPerNet | 50.2 | 148M | 964G | model |
Follow the conda environment instructions in Object Detection section.
Then clone the repository Beit (commit 8b57ed1
) and copy our files from semantic_segmentation
into the folder.
To prepare dataset,
cd <root>/semantic_segmentation
mkdir -p data/ade
cd data/ade
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
For more details, check: https://github.com/open-mmlab/mmsegmentation/blob/v0.11.0/docs/dataset_prepare.md
To train a segmentation model with a ConvNeXt-1D/2D/1D++/2D++ backbone, run:
# single-gpu training
NGPUS=1 CONFIG=<CONFIG_FILE> PRETRAINED=<PRETRAIN MODEL> EXPERIMENT_NAME=<RUN NAME> ./scripts/train_semantic_segmentation.sh
# multi-gpu training
NGPUS=8 CONFIG=<CONFIG_FILE> PRETRAINED=<PRETRAIN MODEL> EXPERIMENT_NAME=<RUN NAME> ./scripts/train_semantic_segmentation.sh
For example, to train a UperNet model with a ConvNeXt-1D-T backbone and 8 gpus, run:
NGPUS=8 CONFIG=upernet_convnext_tiny_1d_512_160k_ade20k_ms PRETRAINED=$PWD/pretrained/backbones/convnext_tiny_1d_best.pth EXPERIMENT_NAME=upernet_convnext_tiny_1d ./scripts/train_semantic_segmentation.sh
Note: Change data=dict(samples_per_gpu=8)
in the configuration file to reflect the number of gpus you are using. By default, we used 4 GPUS for our training.
To evaluate a segmentation model with a ConvNeXt-1D/2D/1D++/2D++ backbone, run:
# multi-gpu evaluation
NGPUS=4 CONFIG=<CONFIG_FILE> CHECKPOINT=<CHECKPOINT MODEL> ./scripts/eval_semantic_segmentation.sh
For example, to evaluate a UperNet model with a ConvNeXt-1D-T backbone and 4 gpus, run:
NGPUS=4 CONFIG=upernet_convnext_tiny_1d_512_160k_ade20k_ms CHECKPOINT=$PWD/pretrained/semantic_segmentation/upernet_convnext_tiny_1d_512_160k_ade20k_ms_seed0_iter_160000.pth ./scripts/eval_semantic_segmentation.sh
We provide PyTorch-level benchmarks of our oriented 1D kernels.
To run the benchmarks, run:
# Faster specialized implementation
python -m models.dwoconv1d_specialized.dwoconv1d_specialized
# Slower general implementation
python -m models.dwoconv1d.dwoconv1d
To reproduce table 5 of our paper, run:
python -m analysis.benchmark
Note: Our oriented 1D kernels are written in CUDA and will be compiled automatically on the first run of any model. They are cached in $HOME/.cache/torch_extensions
by default. The installation may take a few minutes to complete.
On 1 NVIDIA RTX3090, you should get the following runtimes on the analysis.benchmark
task:
Method | |||
---|---|---|---|
PyTorch | 0.4 | 1.1 | |
Ours | 0.3 | 0.8 | |
PyTorch | 1.4 | 3.8 | |
Ours | 0.9 | 2.6 | |
PyTorch | 5.5 | 15.0 | |
Ours | 3.2 | 9.9 |
To compute the ERF on ConvNeXt-B-1D, run the following command:
python -m analysis.visualize_erf --model convnext_base --variant 1d --data_path data/imagenet --save_path erf_convnext_base_1d.npy --num_images 1000
To visualize the ERF, run the following command:
python -m analysis.analyze_erf --source erf_convnext_base_1d.npy
The code was adapted from RepLKNet. We thank the authors for their work.
We provide two implementations of oriented 1D kernels: specialized vs non-specialized kernels.
We use the specialized implementation to benchmark our kernels and use the non-specialized implementation for network-level inference.
The non-specialized implementation is a bit slower than the specialized implementation, but can take arbitrary sizes as input (provided there is enough GPU shared memory capacity).
The specialized implementation compiles one kernel for every input size, which is not feasible for object detection or arbitrary images.
Note also that our implementations do not support inputs which are not divisible by the stride (e.g. 65x31 for stride=2) which is why we use a slower forward pass forward_safe
for object detection and semantic segmentation. forward_safe
also handles the alignment issues we encountered during development.
Note that our kernels were optimized for specific input sizes (56x56 mainly). Tweaking kernel-level constants may be needed to achieve optimal performance for other input sizes.
Strided convolutions are currently a bit slower than PyTorch strided convolutions, which explains part of the slowdown encountered by ConvNeXt-1D/1D++ over ConvNeXt-2D/2D++.
This repository is built on top of the ConvNeXt codebase, RepLKNet, and NVIDIA Cutlass. We thank the authors for their work.
This project is released under the MIT license. Please see the LICENSE file for more information.