(back to README.md and README_multimodal.md for other categories)

Overview

Citation
Other High-level Vision Tasks
Transfer / X-Supervised / X-Shot / Continual Learning
Low-level Vision Tasks
Reinforcement Learning
- Navigation
- Other RL Tasks
Medical
Other Tasks
Attention Mechanisms in Vision/NLP
- Attention for Vision
- NLP
- Both
- Others

Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Other High-level Vision Tasks

Point Cloud / 3D

PCT: "PCT: Point Cloud Transformer", arXiv, 2020 (Tsinghua). [Paper][Jittor][PyTorch (uyzhang)]
Point-Transformer: "Point Transformer", arXiv, 2020 (Ulm University). [Paper]
NDT-Transformer: "NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation", ICRA, 2021 (University of Sheffield). [Paper][PyTorch]
P4Transformer: "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos", CVPR, 2021 (NUS). [Paper]
SnowflakeNet: "SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer", ICCV, 2021 (Tsinghua). [Paper][PyTorch]
PoinTr: "PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers", ICCV, 2021 (Tsinghua). [Paper][PyTorch]
Point-Transformer: "Point Transformer", ICCV, 2021 (Oxford + CUHK). [Paper][PyTorch (lucidrains)]
CT: "Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks", ICCV, 2021 (Samsung). [Paper]
3DVG-Transformer: "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds", ICCV, 2021 (Beihang University). [Paper]
PPT-Net: "Pyramid Point Cloud Transformer for Large-Scale Place Recognition", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
?: "Shape registration in the time of transformers", NeurIPS, 2021 (Sapienza University of Rome). [Paper]
YOGO: "You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module", arXiv, 2021 (Berkeley). [Paper][PyTorch]
DTNet: "Dual Transformer for Point Cloud Analysis", arXiv, 2021 (Southwest University). [Paper]
MLMSPT: "Point Cloud Learning with Transformer", arXiv, 2021 (Southwest University). [Paper]
PQ-Transformer: "PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds", arXiv, 2021 (Tsinghua). [Paper][PyTorch]
PST²: "Spatial-Temporal Transformer for 3D Point Cloud Sequences", WACV, 2022 (Sun Yat-sen University). [Paper]
SCTN: "SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation", AAAI, 2022 (KAUST). [Paper]
AWT-Net: "Adaptive Wavelet Transformer Network for 3D Shape Representation Learning", ICLR, 2022 (NYU). [Paper]
?: "Deep Point Cloud Reconstruction", ICLR, 2022 (KAIST). [Paper]
PointMLP: "Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework", ICLR, 2022 (Northeastern). [Paper][PyTorch]
HiTPR: "HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud", ICRA, 2022 (Nanjing University of Science and Technology). [Paper]
FastPointTransformer: "Fast Point Transformer", CVPR, 2022 (POSTECH). [Paper]
REGTR: "REGTR: End-to-end Point Cloud Correspondences with Transformers", CVPR, 2022 (NUS, Singapore). [Paper][PyTorch]
ShapeFormer: "ShapeFormer: Transformer-based Shape Completion via Sparse Representation", CVPR, 2022 (Shenzhen University). [Paper][Website]
PatchFormer: "PatchFormer: An Efficient Point Transformer with Patch Attention", CVPR, 2022 (Hangzhou Dianzi University). [Paper]
?: "An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation", CVPR, 2022 (NTU + NYCU). [Paper][Code (in construction)]
Point-BERT: "Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling", CVPR, 2022 (Tsinghua). [Paper][PyTorch][Website]
GeoTransformer: "Geometric Transformer for Fast and Robust Point Cloud Registration", CVPR, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
PointCLIP: "PointCLIP: Point Cloud Understanding by CLIP", CVPR, 2022 (Shanghai AI Lab). [Paper][PyTorch]
?: "3D Part Assembly Generation with Instance Encoded Transformer", IROS, 2022 (Tongji University). [Paper]
SeedFormer: "SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer", ECCV, 2022 (Tencent). [Paper][PyTorch]
MeshMAE: "MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis", ECCV, 2022 (JD). [Paper]
PPTr: "Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding", ECCV, 2022 (Tsinghua University). [Paper]
Geodesic-Former: "Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter", ECCV, 2022 (VinAI Research, Vietnam). [Paper]
LaplacianMesh-Transformer: "Laplacian Mesh Transformer: Dual Attention and Topology Aware Network for 3D Mesh Classification and Segmentation", ECCV, 2022 (CAS). [Paper]
Point-MixSwap: "Point MixSwap: Attentional Point Cloud Mixing via Swapping Matched Structural Divisions", ECCV, 2022 (NYCU + NTU). [Paper][PyTorch]
PointMixer: "PointMixer: MLP-Mixer for Point Cloud Understanding", ECCV, 2022 (KAIST). [Paper]
Point-Transformer-V2: "Point Transformer V2: Grouped Vector Attention and Partition-based Pooling", NeurIPS, 2022 (HKU). [Paper][PyTorch (in construction)]
SPoVT: "SPoVT: Semantic-Prototype Variational Transformer for Dense Point Cloud Semantic Completion", NeurIPS, 2022 (NTU). [Paper][PyTorch][Website]
GSA: "Geodesic Self-Attention for 3D Point Clouds", NeurIPS, 2022 (East China Normal University). [Paper]
P2P: "P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch][Website]
3DTRL: "Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space", NeurIPS, 2022 (Stony Brook). [Paper][PyTorch][Website]
ShapeCrafter: "ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model", NeurIPS, 2022 (Brown). [Paper]
XMFnet: "Cross-modal Learning for Image-Guided Point Cloud Shape Completion", NeurIPS, 2022 (Politecnico di Torino, Italy). [Paper]
Point-M2AE: "Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training", NeurIPS, 2022 (CUHK). [Paper][PyTorch]
LighTN: "LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling", arXiv, 2022 (Beijing Jiaotong University). [Paper]
PMP-Net++: "PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths", arXiv, 2022 (Tsinghua). [Paper]
SnowflakeNet: "Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
3DCTN: "3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification", arXiv, 2022 (University of Waterloo, Canada). [Paper]
VNT-Net: "VNT-Net: Rotational Invariant Vector Neuron Transformers", arXiv, 2022 (Ben-Gurion University of the Negev, Israel). [Paper]
CompleteDT: "CompleteDT: Point Cloud Completion with Dense Augment Inference Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
VN-Transformer: "VN-Transformer: Rotation-Equivariant Attention for Vector Neurons", arXiv, 2022 (Waymo). [Paper]
Voxel-MAE: "Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds", arXiv, 2022 (Chalmers University of Technology, Sweden). [Paper]
MAE3D: "Masked Autoencoders in 3D Point Cloud Representation Learning", arXiv, 2022 (Northwest A&F University, China). [Paper]
Pix4Point: "Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding", arXiv, 2022 (KAUST). [Paper][Code (in construction)]
MVP: "Multiple View Performers for Shape Completion", arXiv, 2022 (Columbia University). [Paper]
Simple3D-Former: "Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?", arXiv, 2022 (UT Austin). [Paper][PyTorch]
3DPCT: "3DPCT: 3D Point Cloud Transformer with Dual Self-attention", arXiv, 2022 (University of Waterloo, Canada). [Paper]
PS-Former: "Point Cloud Recognition with Position-to-Structure Attention Transformers", arXiv, 2022 (UCSD). [Paper]
LCPFormer: "LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context Propagation in Transformers", arXiv, 2022 (Aberystwyth University, UK). [Paper]
R²-MLP: "R²-MLP: Round-Roll MLP for Multi-View 3D Object Recognition", arXiv, 2022 (Baidu). [Paper]
PVT3D: "PVT3D: Point Voxel Transformers for Place Recognition from Sparse Lidar Scans", arXiv, 2022 (TUM). [Paper]
EPCL: "Frozen CLIP Model is Efficient Point Cloud Backbone", arXiv, 2022 (Shanghai AI Lab). [Paper]
CAT: "Context-Aware Transformer for 3D Point Cloud Automatic Annotation", AAAI, 2023 (HKU). [Paper]
ACT: "Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?", ICLR, 2023 (Megvii). [Paper][PyTorch]
AnalogicalNets: "Analogy-Forming Transformers for Few-Shot 3D Parsing", ICLR, 2023 (CMU). [Paper][Website]
ViPFormer: "ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised Pointcloud Understanding", ICRA, 2023 (Renmin University of China). [Paper][PyTorch]
ProxyFormer: "ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer", CVPR, 2023 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
I2P-MAE.: "Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
RoITr: "Rotation-Invariant Transformer for Point Cloud Matching", CVPR, 2023 (TUM). [Paper]
SphereFormer: "Spherical Transformer for LiDAR-based 3D Recognition", CVPR, 2023 (CUHK). [Paper][PyTorch]
SPoTr: "Self-positioning Point-based Transformer for Point Cloud Understanding", CVPR, 2023 (Korea University). [Paper][PyTorch (in construction)]
PointCMP: "PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
GeoMAE: "GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training", CVPR, 2023 (Tsinghua). [Paper][Code (in construction)]
ULIP: "ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding", CVPR, 2023 (Salesforce). [Paper][PyTorch][Website]
PointConvFormer: "PointConvFormer: Revenge of the Point-based Convolution", CVPR, 2023 (Apple). [Paper]
AnchorFormer: "AnchorFormer: Point Cloud Completion from Discriminative Nodes", CVPR, 2023 (USTC). [Paper][PyTorch]
FlatFormer: "FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer", CVPR, 2023 (MIT). [Paper][Website]
PEAL: "PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration", CVPR, 2023 (Hangzhou Dianzi University). [Paper]
APES: "Attention-based Point Cloud Edge Sampling", CVPR, 2023 (Karlsruhe Institute of Technology, Germany). [Paper]
GD-MAE: "GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
ShapeClipper: "ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency", CVPR, 2023 (Georgia Tech). [Paper][Code (in construction)][Website]
MSC: "Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning", CVPR, 2023 (HKU). [Paper][PyTorch]
MSP: "Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding", CVPR, 2023 (MPI). [Paper]
MM-3DScene: "MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency", CVPR, 2023 (CAS). [Paper][PyTorch][Website]
?: "Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction", ICML, 2023 (UT Austin). [Paper]
ReCon: "Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining", ICML, 2023 (Megvii). [Paper][PyTorch]
OctFormer: "OctFormer: Octree-based Transformers for 3D Point Clouds", SIGGRAPH, 2023 (Peking University). [Paper][Code (in construction)][Website]
SVDFormer: "SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator", ICCV, 2023 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
TAP: "Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
MATE: "MATE: Masked Autoencoders are Online 3D Test-Time Learners", ICCV, 2023 (Graz University of Technology, Austria). [Paper][PyTorch]
DeFormer: "DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image", ICCV, 2023 (Rutgers). [Paper]
RegFormer: "RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
PointCLIP-V2: "PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning", ICCV, 2023 (CUHK). [Paper][PyTorch]
CLIP2Point: "CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training", ICCV, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
IDPT: "Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
JM3D: "Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation", ACMMM, 2023 (NetEase, China). [Paper][PyTorch]
Bridge3D: "Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models", NeurIPS, 2023 (Clemson). [Paper][Code (in construction)]
ConDaFormer: "ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding", NeurIPS, 2023 (JD). [Paper][PyTorch]
DiT-3D: "DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation", NeurIPS, 2023 (Huawei). [Paper][PyTorch][Website]
OpenShape: "OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding", NeurIPS, 2023 (UCSD). [Paper][PyTorch][Website]
PointGPT: "PointGPT: Auto-regressively Generative Pre-training from Point Clouds", NeurIPS, 2023 (Beijing Institute of Technology). [Paper]
PIC: "Explore In-Context Learning for 3D Point Cloud Understanding", NeurIPS, 2023 (Sun Yat-sen University). [Paper][PyTorch]
GeoTransformer: "GeoTransformer: Fast and Robust Point Cloud Registration with Geometric Transformer", TPAMI, 2023 (National University of Defense Technology, China). [Paper][PyTorch]
Text4Point: "Joint Representation Learning for Text and 3D Point Cloud", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
FullFormer: "FullFormer: Generating Shapes Inside Shapes", arXiv, 2023 (University of Siegen, Germany). [Paper]
Joint-MAE: "Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training", arXiv, 2023 (CUHK). [Paper]
PointCAT: "PointCAT: Cross-Attention Transformer for point cloud", arXiv, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
MGT: "Multi-scale Geometry-aware Transformer for 3D Point Cloud Classification", arXiv, 2023 (TUM). [Paper]
Swin3D: "Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
ViewFormer: "ViewFormer: View Set Attention for Multi-view 3D Shape Understanding", arXiv, 2023 (Renmin University of China). [Paper]
ULIP-2: "ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding", arXiv, 2023 (Salesforce). [Paper]
CDFormer: "Collect-and-Distribute Transformer for 3D Point Cloud Analysis", arXiv, 2023 (The University of Sydney). [Paper][PyTorch]
PointCAM: "Self-supervised adversarial masking for 3D point cloud representation learning", arXiv, 2023 (Wrocław University of Science and Technology, Poland). [Paper][PyTorch]
PPT: "Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training", arXiv, 2023 (HKU). [Paper][PyTorch]
Uni3D: "Uni3D: Exploring Unified 3D Representation at Scale", arXiv, 2023 (BAAI). [Paper][PyTorch]
JM3D: "JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
PonderV2: "PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
MeshGPT: "MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers", arXiv, 2023 (TUM). [Paper][Website]
PTv3: "Point Transformer V3: Simpler, Faster, Stronger", arXiv, 2023 (HKU). [Paper][Code (in construction)]
3D-LFM: "3D-LFM: Lifting Foundation Model", arXiv, 2023 (CMU). [Paper][Code][Webite]
LAST-PCL: "Language-Assisted 3D Scene Understanding", AAAI, 2024 (Peking). [Paper][Code (in construction)]
MM-Point: "MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding", AAAI, 2024 (Southeast University, China). [Paper]
Point-PEFT: "Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models", AAAI, 2024 (Shanghai AI Lab). [Paper][PyTorch]
DAPT: "Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis", CVPR, 2024 (Huazhong University of Science and Technology (HUST)). [Paper][PyTorch]
UniPVU-Human: "A Unified Framework for Human-centric Point Cloud Video Understanding", CVPR, 2024 (ShanghaiTech). [Paper]
PointMamba: "PointMamba: A Simple State Space Model for Point Cloud Analysis", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch]
Swin3D++: "Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding", arXiv, 2024 (Microsoft). [Paper]
PCM: "Point Could Mamba: Point Cloud Learning via State Space Model", arXiv, 2024 (Skywork AI, China). [Paper][Code (in construction)]
Point-Mamba: "Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy", arXiv, 2024 (Shanghai Jiao Tong). [Paper][PyTorch]
PIC-S: "Point-In-Context: Understanding Point Cloud via In-Context Learning", arXiv, 2024 (Peking). [Paper][Website][PyTorch]
?: "Pose Priors from Language Models", arXiv, 2024 (Berkeley). [Paper]

[Back to Overview]

Pose Estimation

Human-body:
- HOT-Net: "HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation", ACMMM. 2020 (Kwai). [Paper]
- TransPose: "TransPose: Towards Explainable Human Pose Estimation by Transformer", arXiv, 2020 (Southeast University). [Paper][PyTorch]
- PTF: "Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration", CVPR, 2021 (ETHZ). [Paper][Code (in construction)][Website]
- METRO: "End-to-End Human Pose and Mesh Reconstruction with Transformers", CVPR, 2021 (Microsoft). [Paper][PyTorch]
- PRTR: "Pose Recognition with Cascade Transformers", CVPR, 2021 (UCSD). [Paper][PyTorch]
- Mesh-Graphormer: "Mesh Graphormer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
- THUNDR: "THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers", ICCV, 2021 (Google). [Paper]
- PoseFormer: "3D Human Pose Estimation with Spatial and Temporal Transformers", ICCV, 2021 (UCF). [Paper][PyTorch]
- TransPose: "TransPose: Keypoint Localization via Transformer", ICCV, 2021 (Southeast University, China). [Paper][PyTorch]
- POTR: "Pose Transformers (POTR): Human Motion Prediction With Non-Autoregressive Transformers", ICCVW, 2021 (Idiap). [Paper]
- TransFusion: "TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation", BMVC, 2021 (UC Irvine). [Paper][PyTorch]
- HRT: "HRFormer: High-Resolution Transformer for Dense Prediction", NeurIPS, 2021 (CAS). [Paper][PyTorch]
- POET: "End-to-End Trainable Multi-Instance Pose Estimation with Transformers", arXiv, 2021 (EPFL). [Paper]
- Lifting-Transformer: "Lifting Transformer for 3D Human Pose Estimation in Video", arXiv, 2021 (Peking). [Paper]
- TFPose: "TFPose: Direct Human Pose Estimation with Transformers", arXiv, 2021 (The University of Adelaide). [Paper][PyTorch]
- Skeletor: "Skeletor: Skeletal Transformers for Robust Body-Pose Estimation", arXiv, 2021 (University of Surrey). [Paper]
- HandsFormer: "HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction", arXiv, 2021 (Graz University of Technology). [Paper]
- TTP: "Test-Time Personalization with a Transformer for Human Pose Estimation", NeurIPS, 2021 (UCSD). [Paper][PyTorch][Website]
- GraFormer: "GraFormer: Graph Convolution Transformer for 3D Pose Estimation", arXiv, 2021 (CAS). [Paper]
- GCT: "Geometry-Contrastive Transformer for Generalized 3D Pose Transfer", AAAI, 2022 (University of Oulu). [Paper][PyTorch]
- MHFormer: "MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation", CVPR, 2022 (Peking). [Paper][PyTorch]
- PAHMT: "Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation", CVPR, 2022 (NetEase). [Paper]
- TCFormer: "Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer", CVPR, 2022 (CUHK). [Paper][PyTorch]
- PETR: "End-to-End Multi-Person Pose Estimation With Transformers", CVPR, 2022 (Hikvision). [Paper][PyTorch]
- GraFormer: "GraFormer: Graph-Oriented Transformer for 3D Pose Estimation", CVPR, 2022 (CAS). [Paper]
- Keypoint-Transformer: "Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation", CVPR, 2022 (Graz University of Technology, Austria). [Paper][PyTorch][Website]
- MPS-Net: "Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video", CVPR, 2022 (Academia Sinica). [Paper][Website]
- Ego-STAN: "Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation", CVPRW, 2022 (University of Waterloo, Canada). [Paper]
- AggPose: "AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation", IJCAI, 2022 (Shenzhen Baoan Women’s and Childiren’s Hospital). [Paper][Code (in construction)]
- MotionMixer: "MotionMixer: MLP-based 3D Human Body Pose Forecasting", IJCAI, 2022 (Ulm University, Germany). [Paper][Code (in construction)]
- Jointformer: "Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation", ICPR, 2022 (Trinity College Dublin, Ireland). [Paper]
- IVT: "IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation", ACMMM, 2022 (Baidu). [Paper]
- FastMETRO: "Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers", ECCV, 2022 (POSTECH). [Paper][PyTorch][Website]
- PPT: "PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation", ECCV, 2022 (UC Irvine). [Paper][PyTorch]
- Poseur: "Poseur: Direct Human Pose Regression with Transformers", ECCV, 2022 (The University of Adelaide, Australia). [Paper]
- ViTPose: "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation", NeurIPS, 2022 (The University of Sydney). [Paper][PyTorch]
- Swin-Pose: "Swin-Pose: Swin Transformer Based Human Pose Estimation", arXiv, 2022 (UMass Lowell) [Paper]
- HeadPosr: "HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders", arXiv, 2022 (ETHZ). [Paper]
- CrossFormer: "CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation", arXiv, 2022 (Canberra University, Australia). [Paper]
- VTP: "VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
- FeatER: "FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER", CVPR, 2023 (UCF). [Paper][Code (in construction)][Website]
- GraphMLP: "GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation", arXiv, 2022 (Peking University). [Paper]
- siMLPe: "Back to MLP: A Simple Baseline for Human Motion Prediction", arXiv, 2022 (INRIA). [Paper][Pytorch]
- Snipper: "Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet", arXiv, 2022 (University of Alberta, Canada). [Paper][PyTorch]
- OTPose: "OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos", arXiv, 2022 (Korea University). [Paper]
- PoseBERT: "PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling", arXiv, 2022 (NAVER). [Paper][PyTorch]
- KOG-Transformer: "K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation", arXiv, 2022 (CAS). [Paper]
- SoMoFormer: "SoMoFormer: Multi-Person Pose Forecasting with Transformers", arXiv, 2022 (Stanford). [Paper]
- DPIT: "DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation", arXiv, 2022 (Shanghai University). [Paper]
- Uplift-Upsample: "Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers", WACV, 2023 (University of Augsburg, Germany). [Paper][Tensorflow]
- TORE: "TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer", ICCV, 2023 (HKU). [Paper][Code (in construction)][Website]
- MPT: "MPT: Mesh Pre-Training with Transformers for Human Pose and Mesh Reconstruction", arXiv, 2022 (Microsoft). [Paper]
- ViTPose+: "ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation", arXiv, 2022 (The University of Sydney). [Paper][PyTorch]
- POT: "Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation", AAAI, 2023 (Shanghai Jiao Tong). [Paper]
- INT: "Capturing the Motion of Every Joint: 3D Human Pose and Shape Estimation with Independent Tokens", ICLR, 2023 (Southeast University). [Paper]
- TBIFormer: "Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting", CVPR, 2023 (Hangzhou Dianzi Universit). [Paper]
- PSVT: "PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers", CVPR, 2023 (Baidu). [Paper]
- PCT: "Human Pose as Compositional Tokens", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
- OSX: "One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer", CVPR, 2023 (IDEA). [Paper][PyTorch][Website]
- PoseFormerV2: "PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation", CVPR, 2023 (UCF). [Paper][PyTorch][Website][Website]
- SA-HMR: "Learning Human Mesh Recovery in 3D Scenes", CVPR, 2023 (Zhejiang University). [Paper][Code (in construction)][Website]
- DeFormer: "Deformable Mesh Transformer for 3D Human Mesh Recovery", CVPR, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper]
- STCFormer: "3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention", CVPR, 2023 (Hefei University of Technology). [Paper]
- DistilPose: "DistilPose: Tokenized Pose Regression With Heatmap Distillation", CVPR, 2023 (Tencent). [Paper][PyTorch]
- LPFormer: "LPFormer: LiDAR Pose Estimation Transformer with Multi-Task Network", CVPRW, 2023 (UCF). [Paper]
- LAMP: "LAMP: Leveraging Language Prompts for Multi-person Pose Estimation", IROS, 2023 (UCF). [Paper][PyTorch]
- DiffPose: "DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation", ICCV, 2023 (Jilin University). [Paper]
- JOTR: "JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery", ICCV, 2023 (Alibaba). [Paper]
- GroupPose: "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation", ICCV, 2023 (Baidu). [Paper][Paddle][PyTorch]
- CoordFormer: "Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos", ICCV, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
- PoseFix: "PoseFix: Correcting 3D Human Poses with Natural Language", ICCV, 2023 (NAVER). [Paper][Website]
- 4D-Humans: "Humans in 4D: Reconstructing and Tracking Humans with Transformers", ICCV, 2023 (Berkeley). [Paper][PyTorch][Website]
- HopFIR: "HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation", ICCV, 2023 (Hefei University of Technology). [Paper]
- HumanMAC: "HumanMAC: Masked Motion Completion for Human Motion Prediction", ICCV, 2023 (Tsinghua). [Paper][PyTorch][Website]
- XFormer: "XFormer: Fast and Accurate Monocular 3D Body Capture", arXiv, 2023 (Huya Inc, China). [Paper]
- PGformer: "PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction", arXiv, 2023 (Alibaba). [Paper]
- ?: "Scene-aware Human Pose Generation using Transformer", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- HoT: "Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation", arXiv, 2023 (Peking). [Paper]
- Pose-Anything: "Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation", arXiv, 2023 (Tel Aviv). [Paper][Code (in construction)][Website]
- PoseGPT: "PoseGPT: Chatting about 3D Human Pose", arXiv, 2023 (MPI). [Paper][Code (in construction)][Website]
- TEMP3D: "TEMP3D: Temporally Continuous 3D Human Pose Estimation Under Occlusions", arXiv, 2023 (UC Riverside). [Paper][Website]
- FinePOSE: "FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models", CVPR, 2024 (University of Science and Technology Beijing). [Paper][PyTorch]
- VLPose: "VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision Tuning", arXiv, 2024 (CUHK). [Paper]
- ?: "Multi-Human Mesh Recovery with Transformers", arXiv, 2024 (Stanford). [Paper]
- WHAC: "WHAC: World-grounded Humans and Cameras", arXiv, 2024 (SenseTime). [Paper][Code (in construction)][Website]
- AiOS: "AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation", arXiv, 2024 (SenseTime). [Paper][Code (in construction)][Website]
- EgoPoseFormer: "EgoPoseFormer: A Simple Baseline for Egocentric 3D Human Pose Estimation", arXiv, 2024 (Meta). [Paper]
Hands:
- Hand-Transformer: "Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation", ECCV, 2020 (Kwai). [Paper]
- SCAT: "SCAT: Stride Consistency With Auto-Regressive Regressor and Transformer for Hand Pose Estimation", ICCVW, 2021 (Alibaba). [Paper]
- SeTHPose: "Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation", arXiv, 2022 (Queen's University, Canada). [Paper]
- HTT: "Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos", CVPR, 2023 (HKU). [Paper][PyTorch][Website]
- ?: "Image-free Domain Generalization via CLIP for 3D Hand Pose Estimation", arXiv, 2022 (UNIST, Korea). [Paper]
- A2J-Transformer: "A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB Image", CVPR, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
- H2OTR: "Transformer-Based Unified Recognition of Two Hands Manipulating Objects", CVPR, 2023 (Ulsan National Institute of Science & Technology (UNIST), Korea). [Paper]
- Deformer: "Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation", ICCV, 2023 (CMU). [Paper][Code (in construction)][Website]
- CLIP-Hand3D: "CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting", ACMMM, 2023 (Ocean University of China). [Paper]
Others:
- TAPE: "Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry", arXiv, 2020 (Tianjing University). [Paper]
- T6D-Direct: "T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression", GCPR, 2021 (University of Bonn). [Paper]
- 6D-ViT: "6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning", arXiv, 2021 (University of Science and Technology of China). [Paper]
- RayTran: "RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers", ECCV, 2022 (Google). [Paper]
- DProST: "DProST: Dynamic Projective Spatial Transformer Network for 6D Pose Estimation", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
- AFT-VO: "AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation", arXiv, 2022 (University of Surrey, UK). [Paper]
- DPT-VO: "Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry", arXiv, 2022 (Aeronautics Institute of Technology, Brazil). [Paper]
- ?: "Video based Object 6D Pose Estimation using Transformers", arXiv, 2022 (Georgia Tech). [Paper][PyTorch]
- PoET: "PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation", arXiv, 2022 (Infineon Technologies Austria AG). [Paper][PyTorch]
- CRT-6D: "CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers", WACV, 2023 (ICL, UK). [Paper][Code (in construction)]
- TokenHPE: "TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers", CVPR, 2023 (Central China Normal University). [Paper][Code (in construction)]
- CLAMP: "CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose", CVPR, 2023 (The University of Sydney). [Paper][Code (in construction)]
- DFTr: "Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation", ICCV, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- c2f-MS-Trans: "Coarse-to-Fine Multi-Scene Pose Regression with Transformers", TPAMI, 2023 (Bar-Ilan University (BIU), Israel). [Paper]
- TransPoser: "TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation", arXiv, 2023 (Kyoto University). [Paper]
- RelPose++: "RelPose++: Recovering 6D Poses from Sparse-view Observations", arXiv, 2023 (CMU). [Paper][PyTorch][Website]
- KDSM: "Language-driven Open-Vocabulary Keypoint Detection for Animal Body and Face", arXiv, 2023 (Shanghai AI Lab). [Paper]
- UniPose: "UniPose: Detecting Any Keypoints", arXiv, 2023 (IDEA). [Paper][Code (in construction)][Website]
- SAM-6D: "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation", arXiv, 2023 (CUHK). [Paper][PyTorch (in construction)]
- ?: "Open-vocabulary object 6D pose estimation", arXiv, 2023 (Fondazione Bruno Kessler (FBK), Italy). [Paper]
- FoundationPose: "FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)][Website]

[Back to Overview]

Tracking

General:
- TransTrack: "TransTrack: Multiple-Object Tracking with Transformer",arXiv, 2020 (HKU + ByteDance). [Paper][PyTorch]
- TransformerTrack: "Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking", CVPR, 2021 (USTC). [Paper][PyTorch]
- TransT: "Transformer Tracking", CVPR, 2021 (Dalian University of Technology). [Paper][PyTorch]
- STARK: "Learning Spatio-Temporal Transformer for Visual Tracking", ICCV, 2021 (Microsoft). [Paper][PyTorch]
- HiFT: "HiFT: Hierarchical Feature Transformer for Aerial Tracking", ICCV, 2021 (Tongji University). [Paper][PyTorch]
- DTT: "High-Performance Discriminative Tracking With Transformers", ICCV, 2021 (CAS). [Paper]
- DualTFR: "Learning Tracking Representations via Dual-Branch Fully Transformer Networks", ICCVW, 2021 (Microsoft). [Paper][PyTorch (in construction)]
- TransCenter: "TransCenter: Transformers with Dense Queries for Multiple-Object Tracking", arXiv, 2021 (INRIA + MIT). [Paper]
- TransMOT: "TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking", arXiv, 2021 (Microsoft). [Paper]
- TREG: "Target Transformed Regression for Accurate Tracking", arXiv, 2021 (Nanjing University). [Paper][Code (in construction)]
- TrTr: "TrTr: Visual Tracking with Transformer", arXiv, 2021 (University of Tokyo). [Paper][PyTorch]
- RelationTrack: "RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation", arXiv, 2021 (Huazhong Univerisity of Science and Technology). [Paper]
- SiamTPN: "Siamese Transformer Pyramid Networks for Real-Time UAV Tracking", WACV, 2022 (New York University). [Paper]
- MixFormer: "MixFormer: End-to-End Tracking with Iterative Mixed Attention", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
- ToMP: "Transforming Model Prediction for Tracking", CVPR, 2022 (ETHZ). [Paper][PyTorch]
- GTR: "Global Tracking Transformers", CVPR, 2022 (UT Austin). [Paper][PyTorch]
- UTT: "Unified Transformer Tracker for Object Tracking", CVPR, 2022 (Meta). [Paper][Code (in construction)]
- MeMOT: "MeMOT: Multi-Object Tracking with Memory", CVPR, 2022 (Amazon). [Paper]
- CSwinTT: "Transformer Tracking with Cyclic Shifting Window Attention", CVPR, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
- STNet: "Spiking Transformers for Event-Based Single Object Tracking", CVPR, 2022 (Dalian University of Technology). [Paper]
- TrackFormer: "TrackFormer: Multi-Object Tracking with Transformers", CVPR, 2022 (Facebook). [Paper][PyTorch]
- SBT: "Correlation-Aware Deep Tracking", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- SparseTT: "SparseTT: Visual Tracking with Sparse Transformers", IJCAI, 2022 (Beihang University). [Paper][Code (in construction)]
- AiATrack: "AiATrack: Attention in Attention for Transformer Visual Tracking", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
- MOTR: "MOTR: End-to-End Multiple-Object Tracking with TRansformer", ECCV, 2022 (Megvii). [Paper][PyTorch]
- SwinTrack: "SwinTrack: A Simple and Strong Baseline for Transformer Tracking", NeurIPS, 2022 (South China University of Technology). [Paper][PyTorch]
- ModaMixer: "Divert More Attention to Vision-Language Tracking", NeurIPS, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
- TransMOT: "Transformers for Multi-Object Tracking on Point Clouds", IV, 2022 (Bosch). [Paper]
- TransT-M: "High-Performance Transformer Tracking", arXiv, 2022 (Dalian University of Technology). [Paper]
- HCAT: "Efficient Visual Tracking via Hierarchical Cross-Attention Transformer", arXiv, 2022 (Dalian University of Technology). [Paper]
- ?: "Keypoints Tracking via Transformer Networks", arXiv, 2022 (KAIST). [Paper][PyTorch]
- TranSTAM: "Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
- TransFiner: "TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking", arXiv, 2022 (China University of Geosciences). [Paper]
- LPAT: "Local Perception-Aware Transformer for Aerial Tracking", arXiv, 2022 (Tongji University). [Paper][PyTorch]
- TADN: "Transformer-based assignment decision network for multiple object tracking", arXiv, 2022 (National Technical University of Athens, Greece). [Paper][Code (in construction)]
- Strong-TransCenter: "Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations", arXiv, 2022 (Tel-Aviv University). [Paper][PyTorch]
- MQT: "End-to-end Tracking with a Multi-query Transformer", arXiv, 2022 (Oxford). [Paper]
- ProContEXT: "ProContEXT: Exploring Progressive Context Transformer for Tracking", arXiv, 2022 (Alibaba). [Paper]
- ?: "Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer", arXiv, 2022 (Sony). [Paper]
- MOTRv2: "MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors", CVPR, 2023 (Megvii). [Paper][Pytorch]
- ViPT: "Visual Prompt Multi-Modal Tracking", CVPR, 2023 (Dalian University of Technology). [Paper][PyTorch]
- GRM: "Generalized Relation Modeling for Transformer Tracking", CVPR, 2023 (HKUST). [Paper][PyTorch]
- DropMAE: "DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks", CVPR, 2023 (CUHK). [Paper][PyTorch]
- OVTrack: "OVTrack: Open-Vocabulary Multiple Object Tracking", CVPR, 2023 (ETHZ). [Paper][Website]
- SeqTrack: "SeqTrack: Sequence to Sequence Learning for Visual Object Tracking", CVPR, 2023 (Dalian University of Technology). [Paper][PyTorch]
- TCOW: "Tracking through Containers and Occluders in the Wild", CVPR, 2023 (Columbia). [Paper][Code (in construction)][Website]
- VideoTrack: "VideoTrack: Learning to Track Objects via Video Transformer", CVPR, 2023 (Microsoft). [Paper]
- MAT: "Representation Learning for Visual Object Tracking by Masked Appearance Transfer", CVPR, 2023 (Dalian University of Technology). [Paper][PyTorch]
- MeMOTR: "MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
- ROMTrack: "Robust Object Modeling for Visual Tracking", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
- HiT: "Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking", ICCV, 2023 (Dalian University of Technology). [Paper][PyTorch]
- OC-MOT: "Object-Centric Multiple Object Tracking", ICCV, 2023 (Amazon). [Paper]
- ColTrack: "Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking", ICCV, 2023 (ByteDance). [Paper][PyTorch]
- MVT: "Mobile Vision Transformer-based Visual Object Tracking", BMVC, 2023 (Concordia University, Canada). [Paper][PyTorch]
- MixFormerV2: "MixFormerV2: Efficient Fully Transformer Tracking", NeurIPS, 2023 (Nanjing University). [Paper][PyTorch]
- MENDER: "Type-to-Track: Retrieve Any Object via Prompt-based Tracking", NeurIPS, 2023 (University of Arkansas). [Paper][Code][Website]
- MOTRv3: "MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking", arXiv, 2023 (Megvii). [Paper]
- OmniMotion: "Tracking Everything Everywhere All at Once", arXiv, 2023 (Cornell). [Paper][PyTorch][Website]
- ?: "A Dual-Source Attention Transformer for Multi-Person Pose Tracking", arXiv, 2023 (University of Bonn, Germany). [Paper]
- TAM: "Track Anything: Segment Anything Meets Videos", arXiv, 2023 (SUSTech). [Paper][PyTorch]
- SAM-Track: "Segment and Track Anything", arXiv, 2023 (Zhejiang University). [Paper][PyTorch]
- CoTracker: "CoTracker: It is Better to Track Together", arXiv, 2023 (Meta). [Paper][PyTorch]
- OVTracktor: "Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models", arXiv, 2023 (CMU). [Paper][Website]
- Un-Track: "Single-Model and Any-Modality for Video Object Tracking", arXiv, 2023 (University of Wurzburg (JMU), Germany). [Paper][Code (in construction)]
- TAO-Amodal: "Tracking Any Object Amodally", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
- ARTrackV2: "ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe", arXiv, 2023 (Xi'an Jiaotong University). [Paper][Website]
- TrackGPT: "Tracking with Human-Intent Reasoning", arXiv, 2024 (Alibaba). [Paper][Code (in construction)]
- SMAT: "Separable Self and Mixed Attention Transformers for Efficient Object Tracking", WACV, 2024 (Concordia University, Canada). [Paper][PyTorch]
- ContrasTR: "Contrastive Learning for Multi-Object Tracking with Transformers", WACV, 2024 (KU Leuven). [Paper]
- M3SOT: "M3SOT: Multi-frame, Multi-field, Multi-space 3D Single Object Tracking", AAAI, 2024 (Xidian University). [Paper]
- EVPTrack: "Explicit Visual Prompts for Visual Object Tracking", AAAI, 2024 (Guangxi Normal University). [Paper]
- OneTracker: "OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning", CVPR, 2024 (Fudan). [Paper]
- SBT: "Correlation-Embedded Transformer Tracking: A Single-Branch Framework", arXiv, 2024 (SJTU). [Paper][PyTorch]
- TAPTR: "TAPTR: Tracking Any Point with Transformers as Detection", arXiv, 2024 (IDEA). [Paper][Website]
- DINO-Tracker: "DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video", arXiv, 2024 (Weizmann Institute of Science, Israel). [Paper][Website]
3D:
- PTT: "PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds", IROS, 2021 (Northeastern University). [Paper][PyTorch (in construction)]
- LTTR: "3D Object Tracking with Transformer", BMVC, 2021 (Northeastern University, China). [Paper][Code (in construction)]
- PTTR: "PTTR: Relational 3D Point Cloud Object Tracking with Transformer", CVPR, 2022 (Sensetime). [Paper][PyTorch]
- STNet: "3D Siamese Transformer Network for Single Object Tracking on Point Clouds", ECCV, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
- CMT: "CMT: Context-Matching-Guided Transformer for 3D Tracking in Point Clouds", ECCV, 2022 (USTC). [Paper]
- PTT: "Real-time 3D Single Object Tracking with Transformer", TMM, 2022 (Northeastern University, China). [Paper][PyTorch]
- InterTrack: "InterTrack: Interaction Transformer for 3D Multi-Object Tracking", arXiv, 2022 (University of Toronto). [Paper]
- PTTR++: "Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
- GLT-T: "GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds", AAAI, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
- 3DMOTFormer: "3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking", ICCV, 2023 (University of Bonn, Germany). [Paper][PyTorch]
- CiteTracker: "CiteTracker: Correlating Image and Text for Visual Tracking", ICCV, 2023 (Peng Cheng Lab). [Paper][PyTorch)]
- MoMA-M3T: "Delving into Motion-Aware Matching for Monocular 3D Object Tracking", ICCV, 2023 (UC Merced). [Paper][Code (in construction)]
- TrajectoryFormer: "TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses", ICCV, 2023 (CUHK). [Paper][Code (in construction)]
- SyncTrack: "Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking", ICCV, 2023 (Zhejiang University). [Paper]
- MBPTrack: "MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors", ICCV, 2023 (Tsinghua). [Paper]
- DQTrack: "End-to-end 3D Tracking with Decoupled Queries", ICCV, 2023 (NVIDIA). [Paper][PyTorch]
- GLT-T++: "GLT-T++: Global-Local Transformer for 3D Siamese Tracking with Ranking Loss", arXiv, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
- BOTT: "BOTT: Box Only Transformer Tracker for 3D Object Tracking", arXiv, 2023 (Motional). [Paper]
- ADA-Track: "ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association", CVPR, 2024 (Mercedes-Benz). [Paper][PyTorch]

[Back to Overview]

Re-ID

PAT: "Diverse Part Discovery: Occluded Person Re-Identification With Part-Aware Transformer", CVPR, 2021 (University of Science and Technology of China). [Paper]
HAT: "HAT: Hierarchical Aggregation Transformers for Person Re-identification", ACMMM, 2021 (Dalian University of Technology). [Paper]
TransReID: "TransReID: Transformer-based Object Re-Identification", ICCV, 2021 (Alibaba). [Paper][PyTorch]
APD: "Transformer Meets Part Model: Adaptive Part Division for Person Re-Identification", ICCVW, 2021 (Meituan). [Paper]
Pirt: "Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification", ACMMM, 2021 (Beihang University). [Paper]
TransMatcher: "Transformer-Based Deep Image Matching for Generalizable Person Re-identification", NeurIPS, 2021 (IIAI). [Paper][PyTorch]
STT: "Spatiotemporal Transformer for Video-based Person Re-identification", arXiv, 2021 (Beihang University). [Paper]
AAformer: "AAformer: Auto-Aligned Transformer for Person Re-Identification", arXiv, 2021 (CAS). [Paper]
TMT: "A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification", arXiv, 2021 (Dalian University of Technology). [Paper]
LA-Transformer: "Person Re-Identification with a Locally Aware Transformer", arXiv, 2021 (University of Maryland Baltimore County). [Paper]
DRL-Net: "Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification", arXiv, 2021 (Peking University). [Paper]
GiT: "GiT: Graph Interactive Transformer for Vehicle Re-identification", arXiv, 2021 (Huaqiao University). [Paper]
OH-Former: "OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification", arXiv, 2021 (Shanghaitech University). [Paper]
CMTR: "CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification", arXiv, 2021 (Beijing Jiaotong University). [Paper]
PFD: "Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer", AAAI, 2022 (Peking). [Paper][PyTorch]
NFormer: "NFormer: Robust Person Re-identification with Neighbor Transformer", CVPR, 2022 (University of Amsterdam, Netherlands). [Paper][Code (in construction)]
DCAL: "Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification", CVPR, 2022 (Advanced Micro Devices, China). [Paper]
CMT: " Cross-Modality Transformer for Visible-Infrared Person Re-identification", ECCV, 2022 (USTC). [Paper]
CAViT: "CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification", ECCV, 2022 (CAS). [Paper][PyTorch]
PiT: "Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval", IEEE Transactions on Industrial Informatics, 2022 (* Peking*). [Paper]
?: "Motion-Aware Transformer For Occluded Person Re-identification", arXiv, 2022 (NetEase, China). [Paper]
PFT: "Short Range Correlation Transformer for Occluded Person Re-Identification", arXiv, 2022 (Nanjing University of Posts and Telecommunications). [Paper]
?: "CLIP-Driven Fine-grained Text-Image Person Re-identification", arXiv, 2022 (Nanjing University of Science and Technology). [Paper]
SeqTR: "Sequential Transformer for End-to-End Person Search", arXiv, 2022 (East China Normal University). [Paper]
CLIP-ReID: "CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels", arXiv, 2022 (East China Normal University). [Paper]
TMGF: "Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification", WACVW, 2023 (Zhejiang University). [Paper][Code (in construction)]
PMT: "Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification", AAAI, 2023 (Jiangsu University). [Paper][Code (in construction)]
DC-Former: "DC-Former: Diverse and Compact Transformer for Person Re-Identification", AAAI, 2023 (Ant Group). [Paper][PyTorch]
PHA: "PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification", CVPR, 2023 (Beihang University). [Paper]
TranSG: "TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch]
UNIReID: "Towards Modality-Agnostic Person Re-Identification With Descriptive Query", CVPR, 2023 (Wuhan University). [Paper][PyTorch]
UniPT: "Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification", ICCV, 2023 (Baidu). [Paper][PyTorch]
PAT: "Part-Aware Transformer for Generalizable Person Re-identification", ICCV, 2023 (UESTC). [Paper][PyTorch]
HAP: "HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception", NeurIPS, 2023 (Baidu). [Paper][PyTorch][Website]
TP-TPS: "Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search", arXiv, 2023 (Tencent). [Paper]
PLIP: "PLIP: Language-Image Pre-training for Person Representation Learning", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][Code (in construction)]
SSCP: "Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection", arXiv, 2023 (Chongqing University of Posts and Telecommunications). [Paper]
PI-VL: "Exploring Part-Informed Visual-Language Learning for Person Re-Identification", arXiv, 2023 (iFLYTEK, China). [Paper]
TBPS-CLIP: "An Empirical Study of CLIP for Text-based Person Search", arXiv, 2023 (Soochow University, China). [Paper][PyTorch]
PersonMAE: "PersonMAE: Person Re-Identification Pre-Training with Masked AutoEncoders", arXiv, 2023 (Microsoft). [Paper]
TF-CLIP: "TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification", AAAI, 2024 (Dalian University of Technology). [Paper]
TOP-ReID: "TOP-ReID: Multi-spectral Object Re-Identification with Token Permutation", AAAI, 2024 (Dalian University of Technology). [Paper][PyTorch]
MP-ReID: "Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification", AAAI, 2024 (Eastern Institute of Technology, China). [Paper]
EDITOR: "Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification", CVPR, 2024 (Dalian University of Technology). [Paper][PyTorch]
VDT: "View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network", CVPR, 2024 (Sun Yat-Sen University). [Paper][PyTorch]
MLLM4Text-ReID: "Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID", CVPR, 2024 (South China University of Technology). [Paper][PyTorch]
AIO: "All in One Framework for Multimodal Re-identification in the Wild", CVPR, 2024 (Wuhan University). [Paper]

[Back to Overview]

Face

General:
- FAU-Transformer: "Facial Action Unit Detection With Transformers", CVPR, 2021 (Rakuten Institute of Technology). [Paper]
- TADeT: "Mitigating Bias in Visual Transformers via Targeted Alignment", BMVC, 2021 (Gerogia Tech). [Paper]
- ViT-Face: "Face Transformer for Recognition", arXiv, 2021 (Beijing University of Posts and Telecommunications). [Paper]
- FaceT: "Learning to Cluster Faces via Transformer", arXiv, 2021 (Alibaba). [Paper]
- VidFace: "VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots", arXiv, 2021 (Zhejiang University). [Paper]
- FAA: "Shuffle Transformer with Feature Alignment for Video Face Parsing", arXiv, 2021 (Tencent). [Paper]
- FaRL: "General Facial Representation Learning in a Visual-Linguistic Manner", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- FaceFormer: "FaceFormer: Speech-Driven 3D Facial Animation with Transformers", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
- PhysFormer: "PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer", CVPR, 2022 (University of Oulu, Finland). [Paper][PyTorch]
- VTP: "Sub-word Level Lip Reading With Visual Attention", CVPR, 2022 (Oxford). [Paper]
- Label2Label: "Label2Label: A Language Modeling Framework for Multi-Attribute Learning", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
- FPVT: "Face Pyramid Vision Transformer", BMVC, 2022 (FloppyDisk.AI, Pakistan). [Paper][PyTorch][Website]
- fViT: "Part-based Face Recognition with Vision Transformers", BMVC, 2022 (Queen Mary University of London). [Paper]
- EventFormer: "EventFormer: AU Event Transformer for Facial Action Unit Event Detection", arXiv, 2022 (Peking). [Paper]
- MFT: "Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers", arXiv, 2022 (SUNY Binghamton). [Paper]
- VC-TRSF: "Self-supervised Video-centralised Transformer for Video Face Clustering", arXiv, 2022 (ICL). [Paper]
- MARLIN: "MARLIN: Masked Autoencoder for facial video Representation LearnINg", CVPR, 2023 (Monash University, Australia). [Paper][PyTorch]
- TransFace: "TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- FaceXFormer: "FaceXFormer: A Unified Transformer for Facial Analysis", arXiv, 2024 (JHU). [Paper][PyTorch][Website]
- Arc2Face: "Arc2Face: A Foundation Model of Human Faces", arXiv, 2024 (ICL). [Paper][PyTorch][Website]
Facial Landmark:
- Clusformer: "Clusformer: A Transformer Based Clustering Approach to Unsupervised Large-Scale Face and Visual Landmark Recognition", CVPR, 2021 (VinAI Research, Vietnam). [Paper]
- LOTR: "LOTR: Face Landmark Localization Using Localization Transformer", arXiv, 2021 (Sertis, Thailand). [Paper]
- SLPT: "Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning", CVPR, 2022 (University of Technology Sydney). [Paper][PyTorch]
- DTLD: "Towards Accurate Facial Landmark Detection via Cascaded Transformers", CVPR, 2022 (Samsung). [Paper]
- RePFormer: "RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection", arXiv, 2022 (CUHK). [Paper]
Face Low-Level Vision:
- Latent-Transformer: "A Latent Transformer for Disentangled Face Editing in Images and Videos", ICCV, 2021 (Institut Polytechnique de Paris). [Paper][PyTorch]
- TANet: "TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network", arXiv, 2021 (Wuhan Institute of Technology). [Paper]
- FAT: "Facial Attribute Transformers for Precise and Robust Makeup Transfer", WACV, 2022 (University of Rochester). [Paper]
- SSAT: "SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal", AAAI, 2022 (Wuhan University). [Paper][PyTorch]
- TransEditor: "TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing", CVPR, 2022 (Shanghai AI Lab). [Paper][PyTorch][Website]
- RestoreFormer: "RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs", CVPR, 2022 (HKU). [Paper]
- HairCLIP: "HairCLIP: Design Your Hair by Text and Reference Image", CVPR, 2022 (USTC). [Paper][PyTorch]
- AnyFace: "AnyFace: Free-style Text-to-Face Synthesis and Manipulation", CVPR, 2022 (CAS). [Paper]
- CodeFormer: "Towards Robust Blind Face Restoration with Codebook Lookup Transformer", NeurIPS, 2022 (NTU, Singapore). [Paper][PyTorch (in construction)][Website]
- Cycle-Text2Face: "Cycle Text2Face: Cycle Text-to-face GAN via Transformers", arXiv, 2022 (Shahed Univerisity, Iran). [Paper]
- FaceFormer: "FaceFormer: Scale-aware Blind Face Restoration with Transformers", arXiv, 2022 (Tencent). [Paper]
- text2StyleGAN: "Text-Free Learning of a Natural Language Interface for Pretrained Face Generators", arXiv, 2022 (Toyota Technological Institute, Chicago). [Paper][PyTorch]
- ManiCLIP: "ManiCLIP: Multi-Attribute Face Manipulation from Text", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
- FEAT: "FEAT: Face Editing with Attention", arXiv, 2022 (Shenzhen University). [Paper]
- CoralStyleCLIP: "CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing", CVPR, 2023 (Adobe). [Paper]
- CLIP2Protect: "CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search", CVPR, 2023 (MBZUAI). [Paper][Code (in construction)]
- PATMAT: "PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting", ICCV, 2023 (CMU). [Paper]
- HairCLIPv2: "HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending", ICCV, 2023 (USTC). [Paper][Code (in construction)]
- RestoreFormer++: "RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs", arXiv, 2023 (HKU). [Paper]
Facial Expression:
- TransFER: "TransFER: Learning Relation-aware Facial Expression Representations with Transformers", ICCV, 2021 (CAS). [Paper]
- CVT-Face: "Robust Facial Expression Recognition with Convolutional Visual Transformers", arXiv, 2021 (Hunan University). [Paper]
- MViT: "MViT: Mask Vision Transformer for Facial Expression Recognition in the wild", arXiv, 2021 (University of Science and Technology of China). [Paper]
- ViT-SE: "Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition", arXiv, 2021 (CentraleSupélec, France). [Paper]
- EST: "Expression Snippet Transformer for Robust Video-based Facial Expression Recognition", arXiv, 2021 (China University of Geosciences). [Paper][PyTorch]
- MFEViT: "MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition", arXiv, 2021 (University of Science and Technology of China). [Paper]
- F-PDLS: "Vision Transformer Equipped with Neural Resizer on Facial Expression Recognition Task", ICASSP, 2022 (KAIST). [Paper]
- ?: "Transformer-based Multimodal Information Fusion for Facial Expression Analysis", arXiv, 2022 (Netease, China). [Paper]
- ?: "Facial Expression Recognition with Swin Transformer", arXiv, 2022 (Dongguk University, Korea). [Paper]
- POSTER: "POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition", arXiv, 2022 (UCF). [Paper]
- STT: "Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild", arXiv, 2022 (Hunan University). [Paper]
- FaceMAE: "FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders", arXiv, 2022 (NUS). [Paper][Code (in construction)]
- TransFA: "TransFA: Transformer-based Representation for Face Attribute Evaluation", arXiv, 2022 (Xidian University). [Paper]
- AU-CVT: "AU-Supervised Convolutional Vision Transformers for Synthetic Facial Expression Recognition", arXiv, 2022 (Shenzhen Technology University). [Paper][PyTorch]
- ?: "Multi-Task Transformer with uncertainty modelling for Face Based Affective Computing", arXiv, 2022 (Datakalab, France). [Paper]
- APViT: "Vision Transformer with Attentive Pooling for Robust Facial Expression Recognition", arXiv, 2022 (Baidu). [Paper]
- Micron-BERT: "Micron-BERT: BERT-based Facial Micro-Expression Recognition", CVPR, 2023 (University of Arkansas). [Paper][PyTorch (in construction)]
- FRL-DGT: "Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition", CVPR, 2023 (Wuhan University). [Paper]
- Text2Listen: "Can Language Models Learn to Listen?", ICCV, 2023 (Berkeley). [Paper][Website]
- CLEF: "Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior Understanding", ICCV, 2023 (Binghamton University). [Paper]
- EmoCLIP: "EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition", arXiv, 2023 (Queen Mary University of London). [Paper][PyTorch]
Attack-related:
- ?: "Video Transformer for Deepfake Detection with Incremental Learning", ACMMM, 2021 (MBZUAI). [Paper]
- ViTranZFAS: "On the Effectiveness of Vision Transformers for Zero-shot Face Anti-Spoofing", International Joint Conference on Biometrics (IJCB), 2021 (Idiap). [Paper]
- MTSS: "Multi-Teacher Single-Student Visual Transformer with Multi-Level Attention for Face Spoofing Detection", BMVC, 2021 (National Taiwan Ocean University). [Paper]
- TransRPPG: "TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection", arXiv, 2021 (University of Oulu). [Paper]
- CViT: "Deepfake Video Detection Using Convolutional Vision Transformer", arXiv, 2021 (Jimma University). [Paper]
- ViT-Distill: "Deepfake Detection Scheme Based on Vision Transformer and Distillation", arXiv, 2021 (Sookmyung Women’s University). [Paper]
- M2TR: "M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection", arXiv, 2021 (Fudan University). [Paper]
- Cross-ViT: "Combining EfficientNet and Vision Transformers for Video Deepfake Detection", arXiv, 2021 (University of Pisa). [Paper][PyTorch]
- ICT: "Protecting Celebrities from DeepFake with Identity Consistency Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- GGViT: "GGViT: Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection", ICPR, 2022 (CAS). [Paper]
- ?: "Hybrid Transformer Network for Deepfake Detection", International Conference on Content-Based Multimedia Indexing (CBMI), 2022 (MediaFutures, Norway). [Paper]
- ViTAF: "Adaptive Transformers for Robust Few-shot Cross-domain Face Anti-spoofing", ECCV, 2022 (Google). [Paper]
- UIA-ViT: "UIA-ViT: Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection", ECCV, 2022 (USTC). [Paper]
- ?: "Multi-Scale Wavelet Transformer for Face Forgery Detection", ACCV, 2022 (Hikvision). [Paper]
- ?: "Self-supervised Transformer for Deepfake Detection", arXiv, 2022 (USTC, China). [Paper]
- ViTransPAD: "ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection", arXiv, 2022 (University of La Rochelle, France). [Paper]
- ?: "Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection", arXiv, 2022 (National Research Council, Italy). [Paper]
- STDT: "Deepfake Video Detection with Spatiotemporal Dropout Transformer", arXiv, 2022 (CAS). [Paper]
- ?: "Deep Convolutional Pooling Transformer for Deepfake Detection", arXiv, 2022 (HKU). [Paper]
- DGM⁴: "Detecting and Grounding Multi-Modal Media Manipulation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- FLIP: "FLIP: Cross-domain Face Anti-spoofing with Language Guidance", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
- Face-Transformer: "Face Transformer: Towards High Fidelity and Accurate Face Swapping", arXiv, 2023 (NTU, Singapore). [Paper]
- DGM⁴: "Detecting and Grounding Multi-Modal Media Manipulation and Beyond", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
- AntifakePrompt: "AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors", arXiv, 2023 (NYCU). [Paper]
- MMDG: "Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing", CVPR, 2024 (Beihang University). [Paper][Code (in construction)]
Fairness:
- TADeT: "Mitigating Bias in Visual Transformers via Targeted Alignment", BMVC, 2021 (Georgia Tech). [Paper]
Generation:
- Describe3D: "High-Fidelity 3D Face Generation from Natural Language Descriptions", CVPR, 2023 (Nanjing University). [Paper][PyTorch]
- LipFormer: "LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook", CVPR, 2023 (Alibaba). [Paper]
- ?: "High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning", CVPR, 2023 (Zhejiang University). [Paper]
3D:
- CodeTalker: "CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior", CVPR, 2023 (CUHK). [Paper][PyTorch][Website]
Age:
- DAA: "DAA: A Delta Age AdaIN operation for age estimation via binary code transformer", CVPR, 2023 (Jiayu Intelligent Technology, China). [Paper][PyTorch]

[Back to Overview]

Neural Architecture Search

HR-NAS: "HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers", CVPR, 2021 (HKU). [Paper][PyTorch]
CATE: "CATE: Computation-aware Neural Architecture Encoding with Transformers", ICML, 2021 (Michigan State). [Paper]
AutoFormer: "AutoFormer: Searching Transformers for Visual Recognition", ICCV, 2021 (Microsoft). [Paper][PyTorch]
GLiT: "GLiT: Neural Architecture Search for Global and Local Image Transformer", ICCV, 2021 (The University of Sydney + SenseTime). [Paper]
BossNAS: "BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search", ICCV, 2021 (Monash University). [Paper][PyTorch]
ViT-ResNAS: "Searching for Efficient Multi-Stage Vision Transformers", ICCVW, 2021 (MIT). [Paper][PyTorch]
AutoformerV2: "Searching the Search Space of Vision Transformer", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
TNASP: "TNASP: A Transformer-based NAS Predictor with a Self-evolution Framework", NeurIPS, 2021 (CAS + Kuaishou). [Paper]
PSViT: "PSViT: Better Vision Transformer via Token Pooling and Attention Sharing", arXiv, 2021 (The University of Sydney + SenseTime). [Paper]
As-ViT: "Auto-scaling Vision Transformers without Training", ICLR, 2022 (UT Austin). [Paper][PyTorch]
NASViT: "NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training", ICLR, 2022 (Facebook). [Paper]
TF-TAS: "Training-free Transformer Architecture Search", CVPR, 2022 (Tencent). [Paper]
ViT-Slim: "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space", CVPR, 2022 (MBZUAI). [Paper][PyTorch]
BurgerFormer: "Searching for BurgerFormer with Micro-Meso-Macro Space Design", ICML, 2022 (CAS). [Paper][Code (in construction)]
UniNet: "UniNet: Unified Architecture Search with Convolution, Transformer, and MLP", ECCV, 2022 (CUHK + SenseTime). [Paper]
ViTAS: "Vision Transformer Architecture Search", ECCV, 2022 (The University of Sydney + SenseTime). [Paper]
VTCAS: "Vision Transformer with Convolutions Architecture Search", arXiv, 2022 (Donghua University). [Paper]
NOAH: "Neural Prompt Search", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
FocusFormer: "FocusFormer: Focusing on What We Need via Architecture Sampler", arXiv, 2022 (Monash University, Australia). [Paper]
NAR-Former: "NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction", CVPR, 2023 (Xidian University, China). [Paper][PyTorch]
MDL-NAS: "MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer", CVPR, 2023 (SenseTime). [Paper]
AutoTaskFormer: "AutoTaskFormer: Searching Vision Transformers for Multi-task Learning", arXiv, 2023 (Microsoft). [Paper]
GPT-NAS: "GPT-NAS: Neural Architecture Search with the Generative Pre-Trained Model", arXiv, 2023 (Sichuan University). [Paper]
NAR-Former-V2: "NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning", arXiv, 2023 (Intellifusion, China). [Paper]
AutoST: "AutoST: Training-free Neural Architecture Search for Spiking Transformers", arXiv, 2023 (NC State). [Paper]
TurboViT: "TurboViT: Generating Fast Vision Transformers via Generative Architecture Search", arXiv, 2023 (University of Waterloo). [Paper]
FLORA: "FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer", WACV, 2024 (NYCU). [Paper][PyTorch]
Auto-Prox: "Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery", AAAI, 2024 (National University of Defense Technology, China). [Paper][Code (in construction)]

[Back to Overview]

Scene Graph

BGT-Net: "BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation", CVPRW, 2021 (ETHZ). [Paper]
STTran: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", ICCV, 2021 (Leibniz University Hannover, Germany). [Paper][PyTorch]
SGG-NLS: "Learning to Generate Scene Graph from Natural Language Supervision", ICCV, 2021 (University of Wisconsin-Madison). [Paper][PyTorch]
SGG-Seq2Seq: "Context-Aware Scene Graph Generation With Seq2Seq Transformers", ICCV, 2021 (Layer 6 AI, Canada). [Paper][PyTorch]
RELAX: "Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs", BMVC, 2021 (Samsung). [Paper]
Relation-Transformer: "Scenes and Surroundings: Scene Graph Generation using Relation Transformer", arXiv, 2021 (LMU Munich). [Paper]
SGTR: "SGTR: End-to-end Scene Graph Generation with Transformer", CVPR, 2022 (ShanghaiTech). [Paper][Code (in construction)]
GCL: "Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation", CVPR, 2022 (Shandong University). [Paper][PyTorch]
Relationformer: "Relationformer: A Unified Framework for Image-to-Graph Generation", ECCV, 2022 (TUM). [Paper][Code (in construction)]
SVRP: "Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning", ECCV, 2022 (Monash University). [Paper]
RelTR: "RelTR: Relation Transformer for Scene Graph Generation", arXiv, 2022 (Leibniz University Hannover, Germany). [Paper][PyTorch]
SG-Shuffle: "SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation", arXiv, 2022 (The University of Sydney). [Paper]
IS-GGT: "Iterative Scene Graph Generation with Generative Transformers", CVPR, 2023 (Oklahoma State University). [Paper]
SQUAT: "Devil's on the Edges: Selective Quad Attention for Scene Graph Generation", CVPR, 2023 (POSTECH). [Paper][PyTorch][Website]
VS³: "Learning to Generate Language-supervised and Open-vocabulary Scene Graph using Pre-trained Visual-Semantic Space", CVPR, 2023 (CUHK). [Paper]
PVSG: "Panoptic Video Scene Graph Generation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
VETO-MEET: "Vision Relation Transformer for Unbiased Scene Graph Generation", ICCV, 2023 (Technical University of Darmstadt, Germany). [Paper][PyTorch]
TextPSG: "TextPSG: Panoptic Scene Graph Generation from Textual Descriptions", ICCV, 2023 (IBM). [Paper][Code (in construction)][Website]
HiLo: "HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation", ICCV, 2023 (King's College London). [Paper][PyTorch]
PSG4DFormer: "4D Panoptic Scene Graph Generation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
SGTR+: "SGTR+: End-to-end Scene Graph Generation with Transformer", TPAMI, 2023 (ShanghaiTech). [Paper][PyTorch]
SGT: "Revisiting Transformer for Point Cloud-based 3D Scene Graph Generation", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper]
EGTR: "EGTR: Extracting Graph from Transformer for Scene Graph Generation", CVPR, 2024 (NAVER). [Paper][Code (in construction)]

[Back to Overview]

Transfer / X-Supervised / X-Shot / Continual Learning

Transfer Learning/Adapter:
- AdaptFormer: "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition", NeurIPS, 2022 (HKU). [Paper][PyTorch][Website]
- Convpass: "Convolutional Bypasses Are Better Vision Transformer Adapters", arXiv, 2022 (Peking University). [Paper][Pytorch]
- FacT: "FacT: Factor-Tuning for Lightweight Adaptation on Vision Transformer", AAAI, 2023 (Peking). [Paper][Pytorch]
- Consolidator: "Consolidator: Mergable Adapter with Group Connections for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
- REACT: "Learning Customized Visual Models with Retrieval-Augmented Knowledge", CVPR, 2023 (Microsoft). [Paper][Code (in construction)][Website]
- MP: "Tuning Pre-trained Model via Moment Probing", ICCV, 2023 (Tianjin University). [Paper][PyTorch]
- ARC: "Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing", NeurIPS, 2023 (Xi'an University of Architecture and Technology). [Paper][PyTorch]
- Res-Tuning: "Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone", NeurIPS, 2023 (Alibaba). [Paper][Website]
- E³VA: "Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions", arXiv, 2023 (Alibaba + Microsoft). [Paper]
- Minimax: "Task-Robust Pre-Training for Worst-Case Downstream Adaptation", arXiv, 2023 (Peking). [Paper]
- HST: "Hierarchical Side-Tuning for Vision Transformers", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
- PELA: "PELA: Learning Parameter-Efficient Models with Low-Rank Approximation", arXiv, 2023 (NUS). [Paper][PyTorch]
- Mona: "Adapter is All You Need for Tuning Visual Tasks", arXiv, 2023 (CAS). [Paper][PyTorch]
- ?: "Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models", arXiv, 2023 (Apple). [Paper]
- GIFT: "GIFT: Generative Interpretable Fine-Tuning Transformers", arXiv, 2023 (NC State). [Paper][Code (in construction)]
- FAPFT: "Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers", arXiv, 2023 (Fudan). [Paper]
- VMT-Adapter: "VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense", AAAI, 2024 (Tencent). [Paper]
- VPTSP: "Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning", ICLR, 2024 (UCF). [Paper][Code (in construction)]
- LORS: "LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking", CVPR, 2024 (Tencent). [Paper]
- Dr²Net: "Dr²Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning", arXiv, 2024 (KAUST). [Paper]
- ViSFT: "Supervised Fine-tuning in turn Improves Visual Foundation Models", arXiv, 2024 (Tencent). [Paper][PyTorch]
- LAST: "Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning", arXiv, 2024 (Nanjing University). [Paper]
- LoSA: "Time-, Memory- and Parameter-Efficient Visual Adaptation", arXiv, 2024 (Google). [Paper]
Domain Adaptation/Domain Generalization/Federated Learning:
- TransDA: "Transformer-Based Source-Free Domain Adaptation", arXiv, 2021 (Haerbin Institute of Technology). [Paper][PyTorch]
- ResTran: "Discovering Spatial Relationships by Transformers for Domain Generalization", arXiv, 2021 (MBZUAI). [Paper]
- WinTR: "Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation", arXiv, 2021 (Beijing Institute of Technology). [Paper]
- CDTrans: "CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation", ICLR, 2022 (Alibaba). [Paper][PyTorch]
- SSRT: "Safe Self-Refinement for Transformer-based Domain Adaptation", CVPR, 2022 (Stony Brook). [Paper]
- DOT: "Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation", ACMMM, 2022 (Beijing Institute of Technology). [Paper]
- GVRT: "Grounding Visual Representations with Texts for Domain Generalization", ECCV, 2022 (LG). [Paper][PyTorch]
- PACMAC: "Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency", NeurIPS, 2022 (Georgia Tech). [Paper][PyTorch]
- ERM-ViT: "Self-Distilled Vision Transformer for Domain Generalization", ACCV, 2022 (MBZUAI). [Paper][PyTorch]
- BCAT: "Domain Adaptation via Bidirectional Cross-Attention Transformer", arXiv, 2022 (Southern University of Science and Technology). [Paper]
- DoTNet: "Towards Unsupervised Domain Adaptation via Domain-Transformer", arXiv, 2022 (Sun Yat-Sen University). [Paper]
- TransDA: "Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation", arXiv, 2022 (Tsinghua). [Paper][PyTorch)]
- FAMLP: "FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization", arXiv, 2022 (University of Science and Technology of China). [Paper]
- DePT: "Visual Prompt Tuning for Test-time Domain Adaptation", arXiv, 2022 (Amazon). [Paper]
- LADS: "Using Language to Extend to Unseen Domains", arXiv, 2022 (Berkeley). [Paper]
- MetaPrompt: "Learning Domain Invariant Prompt for Vision-Language Models", arXiv, 2022 (Tongji University + Microsoft). [Paper]
- TVT: "TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation", WACV, 2023 (UT Arlington + Kuaishou). [Paper][PyTorch]
- GMoE: "Sparse Mixture-of-Experts are Domain Generalizable Learners", ICLR, 2023 (NTU, Singapore). [Paper]
- PMTrans: "Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective", CVPR, 2023 (HKUST). [Paper][Website (in construction)]
- ALOFT: "ALOFT: A Lightweight MLP-like Architecture with Dynamic Low-frequency Transform for Domain Generalization", CVPR, 2023 (Nanjing University). [Paper][PyTorch]
- PromptStyler: "PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization", ICCV, 2023 (Agency for Defense Development, Korea). [Paper][Website]
- DSiT: "Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation", ICCV, 2023 (Indian Institute of Science). [Paper][Website]
- pFedPG: "Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation", ICCV, 2023 (NVIDIA). [Paper]
- FedPerfix: "FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning", ICCV, 2023 (UCF). [Paper][Code (in construction)]
- RISE: "A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance", ICCV, 2023 (UW-Madison). [Paper][PyTorch]
- PØDA: "PØDA: Prompt-driven Zero-shot Domain Adaptation", ICCV, 2023 (INRIA). [Paper][PyTorch][Website]
- AD-CLIP: "AD-CLIP: Adapting Domains in Prompt Space Using CLIP", ICCVW, 2023 (IIT Bombay). [Paper]
- MPA: "Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation", NeurIPS, 2023 (Fudan University). [Paper]
- FedCLIP: "FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning", arXiv, 2023 (CAS). [Paper][PyTorch]
- UniOOD: "Universal Domain Adaptation from Foundation Models", arXiv, 2023 (South China University of Technology). [Paper][Code (in construction)]
- PEST: "Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation", arXiv, 2023 (NTU, Singapore). [Paper]
- ?: "Open-Set Domain Adaptation with Visual-Language Foundation Models", arXiv, 2023 (The University of Tokyo). [Paper]
- VPA: "VPA: Fully Test-Time Visual Prompt Adaptation", arXiv, 2023 (Meta). [Paper]
- FedTPG: "Text-driven Prompt Generation for Vision-Language Models in Federated Learning", arXiv, 2023 (Bosch). [Paper]
- StyLIP: "StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization", WACV, 2024 (TUM). [Paper]
- ReCLIP: "ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation", WACV, 2024 (Amazon). [Paper][PyTorch]
- FedLGT: "Language-Guided Transformer for Federated Multi-Label Classification", AAAI, 2024 (NTU). [Paper][Code (in construction)]
- FedAPT: "Cross-domain Federated Adaptive Prompt Tuning for CLIP", AAAI, 2024 (Fudan University). [Paper][PyTorch]
- VDPG: "Adapting to Distribution Shift by Visual Domain Prompt Generation", ICLR, 2024 (University of Toronto). [Paper][PyTorch][Website]
- UniMoS: "Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation", CVPR, 2024 (University of Electronic Science and Technology of China). [Paper][PyTorch]
- DiPrompT: "DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning", CVPR, 2024 (SenseTime). [Paper]
- ULDA: "Unified Language-driven Zero-shot Domain Adaptation", CVPR, 2024 (CUHK). [Paper]
- LLaVO: "Large Language Models as Visual Cross-Domain Learners", arXiv, 2024 (Southern University of Science and Technology). [Paper][Website][PyTorch]
- LaGTrAn: "Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos", arXiv, 2024 (UCSD). [Paper][Website]
- DGMamba: "DGMamba: Domain Generalization via Generalized State Space Model", arXiv, 2024 (SJTU). [Paper][Code (in construction)]
X-Supervised:
- Semiformer: "Semi-Supervised Vision Transformers", ECCV, 2022 (Fudan University). [Paper][PyTorch]
- SVL-Adapter: "SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models", BMVC, 2022 (UCL). [Paper][Code (in construction)]
- Semi-ViT: "Semi-supervised Vision Transformers at Scale", NeurIPS, 2022 (Amazon). [Paper]
- DPT: "Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels", NeurIPS, 2023 (Renmin University of China). [Paper][PyTorch]
- DiversitySSL: "On Pretraining Data Diversity for Self-Supervised Learning", arXiv, 2024 (KAUST). [Paper][Code (in construction)]
Zero-Shot:
- ViT-ZSL: "Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning", IMVIP, 2021 (University of Exeter, UK). [Paper]
- TransZero: "TransZero: Attribute-guided Transformer for Zero-Shot Learning", AAAI, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
- ?: "Zero-shot Visual Commonsense Immorality Prediction", BMVC, 2022 (Korea University). [Paper][PyTorch]
- I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
- HRT: "Hybrid Routing Transformer for Zero-Shot Learning", arXiv, 2022 (Xidian University). [Paper]
- CuPL: "What does a platypus look like? Generating customized prompts for zero-shot image classification", arXiv, 2022 (University of Washington). [Paper][PyTorch]
- VL-Taboo: "VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models", arXiv, 2022 (Goethe University Frankfurt, Germany). [Paper][Code (in construction)]
- CALIP: "CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention", arXiv, 2022 (Peking University). [Paper]
- PromptCompVL: "Prompting Large Pre-trained Vision-Language Models For Compositional Concept Learning", arXiv, 2022 (Michigan State). [Paper]
- MUST: "Masked Unsupervised Self-training for Zero-shot Image Classification", ICLR, 2023 (Salesforce). [Paper][PyTorch]
- I2MVFormer: "I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification", CVPR, 2023 (ETHZ). [Paper]
- ADE: "Learning Attention as Disentangler for Compositional Zero-shot Learning", CVPR, 2023 (HKU). [Paper][PyTorch][Website]
- CHiLS: "CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets", ICML, 2023 (UCSD). [Paper][PyTorch]
- CoT: "Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning", ICCV, 2023 (Yonsei). [Paper][PyTorch]
- diffusion-classifier: "Your Diffusion Model is Secretly a Zero-Shot Classifier", ICCV, 2023 (CMU). [Paper][PyTorch][Website]
- SuS-X: "SuS-X: Training-Free Name-Only Transfer of Vision-Language Models", ICCV, 2022 (Cambridge). [Paper][PyTorch]
- ?: "Text-to-Image Diffusion Models are Zero-Shot Classifiers", NeurIPS, 2023 (DeepMind). [Paper]
- AutoCLIP: "AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models", arXiv, 2023 (Bosch). [Paper]
- MMPT: "Prompt Tuning for Zero-shot Compositional Learning", arXiv, 2023 (Samsung). [Paper]
- ZSLViT: "Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning", CVPR, 2024 (MBZUAI). [Paper]
X-Shot:
- CrossTransformer: "CrossTransformers: spatially-aware few-shot transfer", NeurIPS, 2020 (DeepMind). [Paper][Tensorflow]
- URT: "A Universal Representation Transformer Layer for Few-Shot Image Classification", ICLR, 2021 (Mila). [Paper][PyTorch]
- TRX: "Temporal-Relational CrossTransformers for Few-Shot Action Recognition", CVPR, 2021 (University of Bristol). [Paper][PyTorch]
- Few-shot-Transformer: "Few-Shot Transformation of Common Actions into Time and Space", arXiv, 2021 (University of Amsterdam). [Paper]
- HCTransformers: "Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning", CVPR, 2022 (Fudan University). [Paper][PyTorch]
- HyperTransformer: "HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning", CVPR, 2022 (Google). [Paper][PyTorch][Website]
- STRM: "Spatio-temporal Relation Modeling for Few-shot Action Recognition", CVPR, 2022 (MBZUAI). [Paper][PyTorch][Website]
- HyperTransformer: "HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning", ICML, 2022 (Google). [Paper]
- CPM: "Compound Prototype Matching for Few-shot Action Recognition", ECCV, 2022 (The University of Tokyo). [Paper]
- SUN: "Self-Promoted Supervision for Few-Shot Transformer", ECCV, 2022 (Harbin Institute of Technology + NUS). [Paper][PyTorch]
- tSF: "tSF: Transformer-Based Semantic Filter for Few-Shot Learning", ECCV, 2022 (Tencent). [Paper]
- TransVLAD: "TransVLAD: Focusing on Locally Aggregated Descriptors for Few-Shot Learning", ECCV, 2022 (Southern University of Science and Technology, China). [Paper]
- BaseTransformers: "BaseTransformers: Attention over base data-points for One Shot Learning", BMVC, 2022 (Dublin City University, Ireland). [Paper][PyTorch]
- FPTrans: "Feature-Proxy Transformer for Few-Shot Segmentation", NeurIPS, 2022 (Baidu). [Paper][Code (in construction)]
- MM-Former: "Mask Matching Transformer for Few-Shot Segmentation", NeurIPS, 2022 (Picsart). [Paper][PyTorch]
- MG-ViT: "Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
- QSFormer: "Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification", arXiv, 2022 (Anhui University). [Paper]
- FS-CT: "Enhancing Few-shot Image Classification with Cosine Transformer", arXiv, 2022 (VinUniversity, Vietnam). [Paper][PyTorch]
- CoCa-CNI: "Exploiting Category Names for Few-Shot Classification with Vision-Language Models", arXiv, 2022 (Google). [Paper]
- SP: "Semantic Prompt for Few-Shot Image Recognition", CVPR, 2023 (USTC). [Paper]
- SMKD: "Supervised Masked Knowledge Distillation for Few-Shot Transformers", CVPR, 2023 (Columbia). [Paper][PyTorch]
- CST: "Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation", CVPR, 2023 (Meta). [Paper]
- Hint-Aug: "Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning", CVPR, 2023 (Georgia Tech). [Paper]
- ProD: "ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification", CVPR, 2023 (University of Technology Sydney). [Paper]
- PVP: "PVP: Pre-trained Visual Parameter-Efficient Tuning", arXiv, 2023 (Defense Innovation Institute, China). [Paper]
- AMU-Tuning: "AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning", CVPR, 2024 (Tianjin University). [Paper]
Continual Learning:
- MEAT: "Meta-attention for ViT-backed Continual Learning", CVPR, 2022 (Zhejiang University). [Paper][Code (in construction)]
- DyTox: "DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion", CVPR, 2022 (Sorbonne Universite, France). [Paper][PyTorch]
- LVT: "Continual Learning With Lifelong Vision Transformer", CVPR, 2022 (The University of Sydney). [Paper]
- L2P: "Learning to Prompt for Continual Learning", CVPR, 2022 (Google). [Paper][Tensorflow]
- ?: "Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones", CVPRW, 2022 (Ca' Foscari University, Italy). [Paper][PyTorch]
- ADA: "Continual Learning with Transformers for Image Classification", CVPRW, 2022 (Amazon). [Paper]
- ?: "Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization", CVPRW, 2022 (Ca' Foscari University, Italy). [Paper]
- DualPrompt: "DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning", ECCV, 2022 (Google). [Paper][Tensorflow]
- CVT: "Online Continual Learning with Contrastive Vision Transformer", ECCV, 2022 (The University of Sydney). [Paper]
- IncCLIP: "Generative Negative Text Replay for Continual Vision-Language Pretraining", ECCV, 2022 (ShanghaiTech). [Paper]
- S-Prompts: "S-Prompts Learning with Pre-trained Transformers: An Occam's Razor for Domain Incremental Learning", NeurIPS, 2022 (Singapore Management University). [Paper]
- ADA: "Memory Efficient Continual Learning with Transformers", NeurIPS, 2022 (Amazon). [Paper]
- BMU-MoCo: "BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling", NeurIPS, 2022 (Renmin University of China). [Paper]
- CLiMB: "CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks", NeurIPS (Datasets and Benchmarks), 2022 (USC). [Paper][PyTorch]
- COLT: "Transformers Are Better Continual Learners", arXiv, 2022 (Hikvision). [Paper]
- D³Former: "D³Former: Debiased Dual Distilled Transformer for Incremental Learning", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
- Continual-CLIP: "CLIP model is an Efficient Continual Learner", arXiv, 2022 (MBZUAI). [Paper][Code (in construction)]
- GCAB-CFDC: "Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free Continual Learning of Vision Transformers", arXiv, 2022 (University of Pavia, Italy). [Paper][Code (in construction)]
- PIVOT: "PIVOT: Prompting for Video Continual Learning", arXiv, 2022 (KAUST). [Paper]
- AttriCLIP: "AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning", CVPR, 2023 (Beihang University). [Paper][PyTorch]
- DKT: "DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning", CVPR, 2023 (Xi'an Jiaotong). [Paper][Code (in construction)]
- CODA-Prompt: "CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning", CVPR, 2023 (IBM). [Paper][PyTorch]
- BiRT: "BiRT: Bio-inspired Replay in Vision Transformers for Continual Learning", ICML, 2023 (NavInfo, Netherlands). [Paper]
- CLR: "CLR: Channel-wise Lightweight Reprogramming for Continual Learning", ICCV, 2023 (USC). [Paper][Code (in construction)]
- CTP: "CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation", ICCV, 2023 (Peng Cheng Lab). [Paper][PyTorch]
- APG: "When Prompt-based Incremental Learning Does Not Meet Strong Pretraining", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
- MAE-CIL: "Masked Autoencoders are Efficient Class Incremental Learners", ICCV, 2023 (Nankai University). [Paper][PyTorch (in construction)]
- ConTraCon: "Exemplar-Free Continual Transformer with Convolutions", ICCV, 2023 (IIT Kharagpur). [Paper][PyTorch][Website]
- LGCL: "Introducing Language Guidance in Prompt-based Continual Learning", ICCV, 2023 (RPTU Kaiserslautern, Germany). [Paper]
- MVP: "Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning", ICCV, 2023 (Kyung Hee University, Korea). [Paper][PyTorch]
- ZSCL: "Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models", ICCV, 2023 (NUS). [Paper][PyTorch]
- C-LN: "On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers", ICCVW, 2023 (University of Trento). [Paper][PyTorch]
- PromptFusion: "PromptFusion: Decoupling Stability and Plasticity for Continual Learning", arXiv, 2023 (Fudan). [Paper]
- MSc-iNCD: "Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery", arXiv, 2023 (University of Trento, Italy). [Paper][PyTorch (in construction)]
- PROOF: "Learning without Forgetting for Vision-Language Models", arXiv, 2023 (Nanjing University). [Paper]
- HePCo: "HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning", arXiv, 2023 (Georgia Tech). [Paper]
- ?: "Continual Learning in Open-vocabulary Classification with Complementary Memory Systems", arXiv, 2023 (UIUC). [Paper]
- MoP-CLIP: "MoP-CLIP: A Mixture of Prompt-Tuned CLIP Models for Domain Incremental Learning", arXiv, 2023 (ETS Montreal, Canada). [Paper]
- TiC-CLIP: "TiC-CLIP: Continual Training of CLIP Models", arXiv, 2023 (Apple). [Paper]
- TIER: "Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning", AAAI, 2024 (Peking). [Paper][Code (in construction)]
- OVOR: "OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning", ICLR, 2024 (JPMorgan Chase). [Paper]
- CPrompt: "Consistent Prompting for Rehearsal-Free Continual Learning", CVPR, 2024 (Sun Yat-sen University). [Paper][PyTorch]
- GMM: "Generative Multi-modal Models are Good Class-Incremental Learners", CVPR, 2024 (Nankai University). [Paper][PyTorch]
- GS-LoRA: "Continual Forgetting for Pre-trained Vision Models", CVPR, 2024 (CAS). [Paper][PyTorch]
- MoE-Adapters: "Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters", CVPR, 2024 (Dalian University of Technology). [Paper][PyTorch]
- ConvPrompt: "Convolutional Prompting meets Language Models for Continual Learning", CVPR, 2024 (IIT Kharagpur). [Paper][Website]
- PriViLege: "Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners", CVPR, 2024 (Kyung Hee University). [Paper][Code (in construction)]
- Multi-LANE: "Less is more: Summarizing Patch Tokens for efficient Multi-Label Class-Incremental Learning", Conference on Lifelong Learning Agents (CoLLAs), 2024 (University of Trento). [Paper][PyTorch]
Long-tail/Imbalanced:
- BatchFormer: "BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
- BatchFormerV2: "BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning", arXiv, 2022 (The University of Sydney). [Paper]
- LPT: "LPT: Long-tailed Prompt Tuning for Image Classification", ICLR, 2023 (Harbin Institute of Technology). [Paper]
- PDC: "Rethink Long-tailed Recognition with Vision Transforms", ICASSP, 2023 (Tsinghua University). [Paper]
- ?: "Exploring Vision-Language Models for Imbalanced Learning", arXiv, 2023 (Peking University). [Paper][PyTorch]
- LMPT: "LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition", arXiv, 2023 (Monash University, Australia). [Paper][PyTorch]
- LTGC: "LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content", CVPR, 2024 (Beijing University of Chemical Technology). [Paper]
- DeiT-LT: "DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets", CVPR, 2024 (Indian Institute of Science, India). [Paper][PyTorch][Website]
Knowledge Distillation:
- ?: "Knowledge Distillation via the Target-aware Transformer", CVPR, 2022 (Alibaba). [Paper]
- DearKD: "DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers", CVPR, 2022 (JD). [Paper]
- AttnDistill: "Attention Distillation: self-supervised vision transformer students need more guidance", BMVC, 2022 (UAB, Spain). [Paper][PyTorch]
- ViTKD: "ViTKD: Practical Guidelines for ViT feature knowledge distillation", arXiv, 2022 (IDEA). [Paper][PyTorch (in construction)]
- ?: "Adaptive Attention Link-based Regularization for Vision Transformers", arXiv, 2022 (Chung-Ang University, Korea). [Paper]
- LiVT: "Learning Imbalanced Data with Vision Transformers", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
- G2SD: "Generic-to-Specific Distillation of Masked Autoencoders", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- SLaK: "Are Large Kernels Better Teachers than Transformers for ConvNets?", ICML, 2023 (Eindhoven University of Technology, Netherlands). [[Paper] (https://arxiv.org/abs/2305.19412)][PyTorch]
- CSKD: "Cumulative Spatial Knowledge Distillation for Vision Transformers", ICCV, 2023 (Megvii). [Paper]
- TinyCLIP: "TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance", ICCV, 2023 (Microsoft). [Paper][PyTorch]
- DIME-FM: "DIME-FM: DIstilling Multimodal and Efficient Foundation Models", ICCV, 2023 (Meta). [Paper]
- MaskedKD: "MaskedKD: Efficient Distillation of Vision Transformers with Masked Images", arXiv, 2023 (POSTECH). [Paper]
- AM-RADIO: "AM-RADIO: Agglomerative Model -- Reduce All Domains Into One", arXiv, 2023 (NVIDIA). [Paper]
Clustering:
- VTCC: "Vision Transformer for Contrastive Clustering", arXiv, 2022 (Sun Yat-sen University, China). [Paper]
Novel Category Discovery:
- PromptCAL: "PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
- CLIP-GCD: "CLIP-GCD: Simple Language Guided Generalized Category Discovery", arXiv, 2023 (Georgia Tech). [Paper]

[Back to Overview]

Low-level Vision Tasks

Image Restoration

General:
- NLRN: "Non-Local Recurrent Network for Image Restoration", NeurIPS, 2018 (UIUC). [Paper][Tensorflow]
- RNAN: "Residual Non-local Attention Networks for Image Restoration", ICLR, 2019 (Northeastern University). [Paper][PyTorch]
- PANet: "Pyramid Attention Networks for Image Restoration", arXiv, 2020 (UIUC). [Paper][PyTorch]
- IPT: "Pre-Trained Image Processing Transformer", CVPR, 2021 (Huawei). [Paper][PyTorch (in construction)]
- SwinIR: "SwinIR: Image Restoration Using Swin Transformer", ICCVW, 2021 (ETHZ). [Paper][PyTorch]
- SiamTrans: "SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers", AAAI, 2022 (Huawei). [Paper]
- Uformer: "Uformer: A General U-Shaped Transformer for Image Restoration", CVPR, 2022 (University of Science and Technology of China). [Paper][PyTorch]
- MAXIM: "MAXIM: Multi-Axis MLP for Image Processing", CVPR, 2022 (Google). [Paper][Tensorflow]
- Restormer: "Restormer: Efficient Transformer for High-Resolution Image Restoration", CVPR, 2022 (IIAI, UAE). [Paper][PyTorch]
- TransWeather: "TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions", CVPR, 2022 (JHU). [Paper][PyTorch][Website]
- KiT: "KNN Local Attention for Image Restoration", CVPR, 2022 (Yonsei University). [Paper]
- ELMformer: "ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer", ACMMM, 2022 (Horizon Robotics). [Paper][Code (in construction)]
- EDT: "On Efficient Transformer-Based Image Pre-training for Low-Level Vision", arXiv, 2022 (CUHK). [Paper][PyTorch]
- ?: "Transform your Smartphone into a DSLR Camera: Learning the ISP in the Wild", arXiv, 2022 (ETHZ). [Paper]
- TMT: "Imaging through the Atmosphere using Turbulence Mitigation Transformer", arXiv, 2022 (Purdue). [Paper][Code (in construction)][Website]
- LRT: "LRT: An Efficient Low-Light Restoration Transformer for Dark Light Field Images", arXiv, 2022 (HKU). [Paper]
- ART: "Accurate Image Restoration with Attention Retractable Transformer", ICLR, 2023 (Shanghai Jiao Tong University). [Paper][PyTorch]
- Burstormer: "Burstormer: Burst Image Restoration and Enhancement Transformer", CVPR, 2023 (MBZUAI). [Paper][Code (in construction)]
- ?: "Comprehensive and Delicate: An Efficient Transformer for Image Restoration", CVPR, 2023 (Sichuan University). [Paper]
- ShuffleFormer: "Random Shuffle Transformer for Image Restoration", ICML, 2023 (USTC). [Paper][PyTorch (in construction)]
- PromptIR: "PromptIR: Prompting for All-in-One Blind Image Restoration", NeurIPS, 2023 (MBZUAI). [Paper][PyTorch]
- UCDIR: "A Unified Conditional Framework for Diffusion-based Image Restoration", NeurIPS, 2023 (CUHK). [Paper][Code (in construction)][Website]
- MAEIP: "Masked Autoencoders as Image Processors", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
- RAMiT: "RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration", arXiv, 2023 (Sogang University, Korea). [Paper]
- RAP: "Restore Anything Pipeline: Segment Anything Meets Image Restoration", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
- ProRes: "ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration", arXiv, 2023 (Horizon Robotics). [Paper][PyTorch (in construction)]
- C2F-DFT: "Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration", arXiv, 2023 (Dalian University of Technology). [Paper][PyTorch (in construction)]
- AutoDIR: "AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion", arXiv, 2023 (CUHK). [Paper]
- MPerceiver: "Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration", arXiv, 2023 (CAS). [Paper]
- TIP: "TIP: Text-Driven Image Processing with Semantic and Restoration Instructions", arXiv, 2023 (Google). [Paper][Website]
- DA-CLIP: "Controlling Vision-Language Models for Universal Image Restoration", ICLR, 2024 (Uppsala University, Sweden). [Paper][PyTorch][Website]
- VmambaIR: "VmambaIR: Visual State Space Model for Image Restoration", arXiv, 2024 (ByteDance). [Paper][Code (in construction)]
- DyNet: "Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration", arXiv, 2024 (MBZUAI). [Paper][Code (in construction)]
- LIPT: "LIPT: Latency-aware Image Processing Transformer", arXiv, 2024 (Huawei). [Paper]
Super-Resolution:
- SAN: "Second-Order Attention Network for Single Image Super-Resolution", CVPR, 2019 (Tsinghua). [Paper][PyTorch]
- CS-NL: "Image Super-Resolution with Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining", CVPR, 2020 (UIUC). [Paper][PyTorch]
- TTSR: "Learning Texture Transformer Network for Image Super-Resolution", CVPR, 2020 (Microsoft). [Paper][PyTorch]
- HAN: "Single Image Super-Resolution via a Holistic Attention Network", ECCV, 2020 (Northeastern University). [Paper][PyTorch]
- NLSN: "Image Super-Resolution With Non-Local Sparse Attention", CVPR, 2021 (UIUC). [Paper]
- ITSRN: "Implicit Transformer Network for Screen Content Image Continuous Super-Resolution", NeurIPS, 2021 (Tianjin University). [Paper][PyTorch]
- FPAN: "Feedback Pyramid Attention Networks for Single Image Super-Resolution", arXiv, 2021 (Nanjing University of Science and Technology). [Paper]
- ESRT: "Efficient Transformer for Single Image Super-Resolution", arXiv, 2021 (Peking University). [Paper]
- Fusformer: "Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
- DPT: "Detail-Preserving Transformer for Light Field Image Super-Resolution", AAAI, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
- BSRT: "BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment", CVPRW, 2022 (Megvii). [Paper][PyTorch]
- TATT: "A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- LBNet: "Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer", IJCAI, 2022 (Nanjing University of Posts and Telecommunications). [Paper][PyTorch (in construction)]
- DATSR: "Reference-based Image Super-Resolution with Deformable Attention Transformer", ECCV, 2022 (ETHZ). [Paper][Code (in construction)]
- ELAN: "Efficient Long-Range Attention Network for Image Super-resolution", ECCV, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- Swin2SR: "Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration", ECCVW, 2022 (University of Wurzburg, Germany). [Paper]
- CAT: "Cross Aggregation Transformer for Image Restoration", NeurIPS, 2022 (Shanghai Jiao Tong). [Paper][PyTorch]
- Stoformer: "Stochastic Window Transformer for Image Restoration", NeurIPS, 2022 (USTC). [Paper][PyTorch]
- LFT: "Light Field Image Super-Resolution with Transformers", IEEE Signal Processing Letters, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
- ELAN: "Efficient Long-Range Attention Network for Image Super-resolution", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
- ACT: "Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution", arXiv, 2022 (LG). [Paper]
- HIPA: "HIPA: Hierarchical Patch Transformer for Single Image Super Resolution", arXiv, 2022 (CUHK). [Paper]
- CTCNet: "CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution", arXiv, 2022 (Nanjing University of Posts and Telecommunications). [Paper]
- ShuffleMixer: "ShuffleMixer: An Efficient ConvNet for Image Super-Resolution", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
- HST: "HST: Hierarchical Swin Transformer for Compressed Image Super-resolution", ECCVW, 2022 (USTC). [Paper]
- SwinFIR: "SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution", arXiv, 2022 (Samsung). [Paper]
- ITSRN++: "ITSRN++: Stronger and Better Implicit Transformer Network for Continuous Screen Content Image Super-Resolution", arXiv, 2022 (Tianjin University). [Paper]
- NGswin: "N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution", CVPR, 2023 (Sogang University, Korea). [Paper][PyTorch]
- OSRT: "OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer", CVPR, 2023 (CAS). [Paper]
- HAT: "Activating More Pixels in Image Super-Resolution Transformer", CVPR, 2023 (University of Macau). [Paper][PyTorch]
- CLIT: "Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution", CVPR, 2023 (MediaTek). [Paper]
- CiaoSR: "CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution", CVPR, 2023 (ETHZ). [Paper][PyTorch]
- HTCAN: "Hybrid Transformer and CNN Attention Network for Stereo Image Super-resolution", CVPRW, 2023 (ByteDance). [Paper]
- DAT: "Dual Aggregation Transformer for Image Super-Resolution", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
- CRAFT: "Feature Modulation Transformer: Cross-Refinement of Global Representation via High-Frequency Prior for Image Super-Resolution", ICCV, 2023 (UESTC). [Paper][Code (in construction)]
- ESSAformer: "ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolution", ICCV, 2023 (Xidian University). [Paper][PyTorch]
- SRFormer: "SRFormer: Permuted Self-Attention for Single Image Super-Resolution", ICCV, 2023 (Nankai University). [Paper]
- ResShift: "ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- RGT: "Recursive Generalization Transformer for Image Super-Resolution", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- SOSR: "SOSR: Source-Free Image Super-Resolution with Wavelet Augmentation Transformer", arXiv, 2023 (CAS). [Paper]
- HAT: "HAT: Hybrid Attention Transformer for Image Restoration", arXiv, 2023 (University of Macau). [Paper][PyTorch]
- PromptSR: "Image Super-Resolution with Text Prompt Diffusion", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
- Inf-DiT: "Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer", arXiv, 2024 (Zhipu AI). [Paper][Code (in construction)]
Denoise:
- CharFormer: "CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising", ACMMM, 2022 (Jilin University). [Paper][PyTorch (in construction)]
- DenSformer: "Dense residual Transformer for image denoising", arXiv, 2022 (University of Science and Technology Beijing). [Paper]
- PoCoformer: "Polarized Color Image Denoising using Pocoformer", arXiv, 2022 (The University of Tokyo). [Paper]
- DnSwin: "DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer", arXiv, 2022 (Guangdong University of Technology). [Paper]
- SST: "Spatial-Spectral Transformer for Hyperspectral Image Denoising", arXiv, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
- MaskedDenoising: "Masked Image Training for Generalizable Deep Image Denoising", CVPR, 2023 (HKUST). [Paper][Code (in construction)]
- SERT: "Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising", CVPR, 2023 (Beijing Institute of Technology). [Paper][PyTorch]
- HSDT: "Hybrid Spectral Denoising Transformer with Guided Attention", ICCV, 2023 (Beijing Institute of Technology). [Paper][PyTorch]
- Xformer: "Xformer: Hybrid X-Shaped Transformer for Image Denoising", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- CLIPDenoising: "Transfer CLIP for Generalizable Image Denoising", CVPR, 2024 (Huazhong University of Science and Technology (HUST)). [Paper]
Others:
- SDNet: "SDNet: multi-branch for single image deraining using swin", arXiv, 2021 (Xinjiang University). [Paper][Code (in construction)]
- ATTSF: "Attention! Stay Focus!", arXiv, 2021 (BridgeAI, Seoul). [Paper][Tensorflow]
- HyLoG-ViT: "Hybrid Local-Global Transformer for Image Dehazing", arXiv, 2021 (Beihang University). [Paper]
- HyperTransformer: "HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening", CVPR, 2022 (JHU). [Paper][PyTorch]
- DeHamer: "Image Dehazing Transformer With Transmission-Aware 3D Position Embedding", CVPR, 2022 (Nankai University). [Paper][Website]
- PTNet: "Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal", ACMMM, 2022 (Fudan University). [Paper]
- TurbNet: "Single Frame Atmospheric Turbulence Mitigation: A Benchmark Study and A New Physics-Inspired Transformer Model", ECCV, 2022 (Purdue + UT Austin). [Paper][PyTorch]
- Stripformer: "Stripformer: Strip Transformer for Fast Image Deblurring", ECCV, 2022 (NTHU). [Paper]
- DehazeFormer: "Vision Transformers for Single Image Dehazing", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
- RSTCANet: "Residual Swin Transformer Channel Attention Network for Image Demosaicing", arXiv, 2022 (Tampere University, Finland). [Paper]
- DRT: "DRT: A Lightweight Single Image Deraining Recursive Transformer", arXiv, 2022 (ANU, Australia). [Paper][PyTorch (in construction)]
- Cubic-Mixer: "UHD Image Deblurring via Multi-scale Cubic-Mixer", arXiv, 2022 (Nanjing University of Science and Technology). [Paper]
- MSP-Former: "MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing", arXiv, 2022 (Jimei University). [Paper]
- ELF: "Magic ELF: Image Deraining Meets Association Learning and Transformer", arXiv, 2022 (Wuhan University). [Paper][PyTorch (in construction)]
- SnowFormer: "SnowFormer: Scale-aware Transformer via Context Interaction for Single Image Desnowing", arXiv, 2022 (Jimei University, China). [Paper]
- DMTNet: "DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer", arXiv, 2022 (Samsung). [Paper]
- LMQFormer: "LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for Lightweight Snow Removal", arXiv, 2022 (Fuzhou University). [Paper]
- Semi-UFormer: "Semi-UFormer: Semi-supervised Uncertainty-aware Transformer for Image Dehazing", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
- WITT: "WITT: A Wireless Image Transmission Transformer for Semantic Communications", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][Code (in construction)]
- BiT: "Blur Interpolation Transformer for Real-World Motion from Blur", CVPR, 2023 (The University of Tokyo). [Paper][PyTorch][Website]
- DRSformer: "Learning A Sparse Transformer Network for Effective Image Deraining", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
- FFTformer: "Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
- MB-TaylorFormer: "MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
- UDR-S²Former: "Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal of Raindrops and Rain Streaks", ICCV, 2023 (HKUST). [Paper][PyTorch][Website]
- HI-Diff: "Hierarchical Integration Diffusion Model for Realistic Image Deblurring", NeurIPS, 2023 (SJTU). [Paper][PyTorch]
- SelfPromer: "SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper]
- ?: "A Data-Centric Solution to NonHomogeneous Dehazing via Vision Transformer", arXiv, 2023 (McMaster University, Canada). [Paper][PyTorch]

[Back to Overview]

Video Restoration

VSR-Transformer: "Video Super-Resolution Transformer", arXiv, 2021 (ETHZ). [Paper][PyTorch]
MANA: "Memory-Augmented Non-Local Attention for Video Super-Resolution", CVPR, 2022 (JD). [Paper]
?: "Bringing Old Films Back to Life", CVPR, 2022 (Microsoft). [Paper][Code (in construction)]
TTVSR: "Learning Trajectory-Aware Transformer for Video Super-Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
Trans-SVSR: "A New Dataset and Transformer for Stereoscopic Video Super-Resolution", CVPR, 2022 (Bahcesehir University, Turkey). [Paper][PyTorch]
STDAN: "STDAN: Deformable Attention Network for Space-Time Video Super-Resolution", CVPRW, 2022 (Tsinghua). [Paper]
VRT: "VRT: A Video Restoration Transformer", arXiv, 2022 (ETHZ). [Paper][PyTorch]
FGST: "Flow-Guided Sparse Transformer for Video Deblurring", ICML, 2022 (Tsinghua). [Paper][Code (in construction)]
RSTT: "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
FTVSR: "Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution", ECCV, 2022 (Microsoft). [Paper][PyTorch]
EFNet: "Event-Based Fusion for Motion Deblurring with Cross-modal Attention", ECCV, 2022 (ETHZ). [Paper]
TempFormer: "TempFormer: Temporally Consistent Transformer for Video Denoising", ECCV, 2022 (Disney). [Paper]
RVRT: "Recurrent Video Restoration Transformer with Guided Deformable Attention", NeurIPS, 2022 (ETHZ). [Paper][PyTorch]
?: "Rethinking Alignment in Video Super-Resolution Transformers", NeurIPS, 2022 (Shanghai AI Lab). [Paper][PyTorch]
VDTR: "VDTR: Video Deblurring with Transformer", arXiv, 2022 (Tsinghua). [Paper][Code (in construction)]
DSCT: "Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
Group-ShiftNet: "No Attention is Needed: Grouped Spatial-temporal Shift for Simple and Efficient Video Restorers", arXiv, 2022 (CUHK). [Paper][Code (in construction)][Website]

[Back to Overview]

Inpainting / Completion / Outpainting

Contexual-Attention: "Generative Image Inpainting with Contextual Attention", CVPR, 2018 (UIUC). [Paper][Tensorflow]
PEN-Net: "Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting", CVPR, 2019 (Microsoft). [Paper][PyTorch]
Copy-Paste: "Copy-and-Paste Networks for Deep Video Inpainting", ICCV, 2019 (Yonsei University). [Paper][PyTorch]
Onion-Peel: "Onion-Peel Networks for Deep Video Completion", ICCV, 2019 (Yonsei University). [Paper][PyTorch]
STTN: "Learning Joint Spatial-Temporal Transformations for Video Inpainting", ECCV, 2020 (Microsoft). [Paper][PyTorch]
FuseFormer: "FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting", ICCV, 2021 (CUHK + SenseTime). [Paper][PyTorch]
ICT: "High-Fidelity Pluralistic Image Completion with Transformers", ICCV, 2021 (CUHK). [Paper][PyTorch][Website]
DSTT: "Decoupled Spatial-Temporal Transformer for Video Inpainting", arXiv, 2021 (CUHK + SenseTime). [Paper][Code (in construction)]
TFill: "TFill: Image Completion via a Transformer-Based Architecture", arXiv, 2021 (NTU Singapore). [Paper][Code (in construction)]
BAT-Fill: "Diverse Image Inpainting with Bidirectional and Autoregressive Transformers", arXiv, 2021 (NTU Singapore). [Paper]
?: "Image-Adaptive Hint Generation via Vision Transformer for Outpainting", WACV, 2022 (Sogang University, Korea). [Paper]
ZITS: "Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding", CVPR, 2022 (Fudan). [Paper][PyTorch][Website]
MAT: "MAT: Mask-Aware Transformer for Large Hole Image Inpainting", CVPR, 2022 (CUHK). [Paper][PyTorch]
PUT: "Reduce Information Loss in Transformers for Pluralistic Image Inpainting", CVPR, 2022 (Microsoft). [Paper][PyTorch]
DLFormer: "DLFormer: Discrete Latent Transformer for Video Inpainting", CVPR, 2022 (Tencent). [Paper][Code (in construction)]
T-former: "T-former: An Efficient Transformer for Image Inpainting", ACMMM, 2022 (Xi'an Jiaotong). [Paper][PyTorch]
QueryOTR: "Outpainting by Queries", ECCV, 2022 (University of Liverpool, UK). [Paper][PyTorch (in construction)]
FGT: "Flow-Guided Transformer for Video Inpainting", ECCV, 2022 (USTC). [Paper][PyTorch]
MAE-FAR: "Learning Prior Feature and Attention Enhanced Image Inpainting", ECCV, 2022 (Fudan University). [Paper][PyTorch (in construction)][Website]
?: "Visual Prompting via Image Inpainting", NeurIPS, 2022 (Berkeley). [Paper][PyTorch][Website]
U-Transformer: "Generalised Image Outpainting with U-Transformer", arXiv, 2022 (Xi'an Jiaotong-Liverpool University). [Paper]
SpA-Former: "SpA-Former: Transformer image shadow detection and removal via spatial attention", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
CRFormer: "CRFormer: A Cross-Region Transformer for Shadow Removal", arXiv, 2022 (Beijing Jiaotong University). [Paper]
DeViT: "DeViT: Deformed Vision Transformers in Video Inpainting", arXiv, 2022 (Kuaishou). [Paper]
ZITS++: "ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors", arXiv, 2022 (Fudan). [Paper]
TPFNet: "TPFNet: A Novel Text In-painting Transformer for Text Removal", arXiv, 2022 (?). [Paper][Code (in construction)]
FlowLens: "FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
?: "Putting People in Their Place: Affordance-Aware Human Insertion into Scenes", CVPR, 2023 (Stanford). [Paper][PyTorch (in construction)][Website]
Imagen-Editor: "Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting", CVPR, 2023 (Google). [Paper][Website]
SmartBrush: "SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model", CVPR, 2023 (Adobe). [Paper]
NÜWA-LIP: "NÜWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN", CVPR, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
ProPainter: "ProPainter: Improving Propagation and Transformer for Video Inpainting", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
Inst-Inpaint: "Inst-Inpaint: Instructing to Remove Objects with Diffusion Models", arXiv, 2023 (Bilkent University, Turkey). [Paper]
Inpaint-Anything: "Inpaint Anything: Segment Anything Meets Image Inpainting", arXiv, 2023 (USTC). [Paper][PyTorch]
TransRef: "TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
DMT: "Deficiency-Aware Masked Transformer for Video Inpainting", arXiv, 2023 (CAS). [Paper][Code (in construction)]
Magicremover: "Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Models", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
LGVI: "Towards Language-Driven Video Inpainting via Multimodal Large Language Models", arXiv, 2024 (Shanghai AI Lab). [Paper][Code (in construction)][Website]

[Back to Overview]

Image Generation

IT: "Image Transformer", ICML, 2018 (Google). [Paper][Tensorflow]
PixelSNAIL: "PixelSNAIL: An Improved Autoregressive Generative Model", ICML, 2018 (Berkeley). [Paper][Tensorflow]
BigGAN: "Large Scale GAN Training for High Fidelity Natural Image Synthesis", ICLR, 2019 (DeepMind). [Paper][PyTorch]
SAGAN: "Self-Attention Generative Adversarial Networks", ICML, 2019 (Google). [Paper][Tensorflow]
VQGAN: "Taming Transformers for High-Resolution Image Synthesis", CVPR, 2021 (Heidelberg University). [Paper][PyTorch][Website]
?: "High-Resolution Complex Scene Synthesis with Transformers", CVPRW, 2021 (Heidelberg University). [Paper]
GANsformer: "Generative Adversarial Transformers", ICML, 2021 (Stanford + Facebook). [Paper][Tensorflow]
PixelTransformer: "PixelTransformer: Sample Conditioned Signal Generation", ICML, 2021 (Facebook). [Paper][Website]
HWT: "Handwriting Transformers", ICCV, 2021 (MBZUAI). [Paper][Code (in construction)]
Paint-Transformer: "Paint Transformer: Feed Forward Neural Painting with Stroke Prediction", ICCV, 2021 (Baidu). [Paper][Paddle][PyTorch]
Geometry-Free: "Geometry-Free View Synthesis: Transformers and no 3D Priors", ICCV, 2021 (Heidelberg University). [Paper][PyTorch]
VTGAN: "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers", ICCVW, 2021 (University of Nevada, Reno). [Paper]
ATISS: "ATISS: Autoregressive Transformers for Indoor Scene Synthesis", NeurIPS, 2021 (NVIDIA). [Paper][Website]
GANsformer2: "Compositional Transformers for Scene Generation", NeurIPS, 2021 (Stanford + Facebook). [Paper][Tensorflow]
TransGAN: "TransGAN: Two Transformers Can Make One Strong GAN", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
HiT: "Improved Transformer for High-Resolution GANs", NeurIPS, 2021 (Google). [Paper][Tensorflow]
iLAT: "The Image Local Autoregressive Transformer", NeurIPS, 2021 (Fudan). [Paper]
TokenGAN: "Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers", NeurIPS, 2021 (Microsoft). [Paper]
SceneFormer: "SceneFormer: Indoor Scene Generation with Transformers", arXiv, 2021 (TUM). [Paper]
SNGAN: "Combining Transformer Generators with Convolutional Discriminators", arXiv, 2021 (Fraunhofer ITWM). [Paper]
Invertible-Attention: "Invertible Attention", arXiv, 2021 (ANU). [Paper]
GPA: "Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation", arXiv, 2021 (Zalando Research, Germany). [Paper][PyTorch (in construction)]
ViTGAN: "ViTGAN: Training GANs with Vision Transformers", ICLR, 2022 (Google). [Paper][PyTorch][PyTorch (wilile26811249)]
ViT-VQGAN: "Vector-quantized Image Modeling with Improved VQGAN", ICLR, 2022 (Google). [Paper]
Style-Transformer: "Style Transformer for Image Inversion and Editing", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
StyleSwin: "StyleSwin: Transformer-based GAN for High-resolution Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
Styleformer: "Styleformer: Transformer based Generative Adversarial Networks with Style Vector", CVPR, 2022 (Seoul National University). [Paper][PyTorch]
?: "User-Controllable Latent Transformer for StyleGAN Image Layout Editing", Pacific Graphics, 2022 (University of Tsukuba). [Paper][Website]
DynaST: "DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation", ECCV, 2022 (NUS). [Paper][PyTorch]
DoodleFormer: "DoodleFormer: Creative Sketch Drawing with Transformers", ECCV, 2022 (MBZUAI). [Paper][PyTorch][Website]
U-Attention: "Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis", arXiv, 2022 (Adobe). [Paper]
MaskGIT: "MaskGIT: Masked Generative Image Transformer", CVPR, 2022 (Google). [Paper][PyTorch (dome272)]
AttnFlow: "Generative Flows with Invertible Attentions", CVPR, 2022 (ETHZ). [Paper]
NÜWA: "NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion", ECCV, 2022 (Microsoft). [Paper][GitHub]
Trans-INR: "Transformers as Meta-Learners for Implicit Neural Representations", ECCV, 2022 (UCSD). [Paper][PyTorch][Websiste]
ViewFormer: "ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers", ECCV, 2022 (Czech Technical University in Prague). [Paper][Tensorflow]
Unleashing-Transformer: "Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes", ECCV, 2022 (Durham University, UK). [Paper][PyTorch]
CASD: "Cross Attention Based Style Distribution for Controllable Person Image Synthesis", ECCV, 2022 (East China Norma lUniversity). [Paper]
VQGAN-CLIP: "VQGAN-CLIP: Open Domain Image Generation and Manipulation Using Natural Language ", ECCV, 2022 (EleutherAI). [Paper][PyTorch]
Token-Critic: "Improved Masked Image Generation with Token-Critic", ECCV, 2022 (Google). [Paper]
PromptGen: "Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models", NeurIPS, 2022 (CMU). [Paper][PyTorch]
Contextual-RQ-Transformer: "Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer", NeurIPS, 2022 (POSTECH + Kakao). [Paper]
ViT-Patch: "A Robust Framework of Chromosome Straightening with ViT-Patch GAN", arXiv, 2022 (Xi'an Jiaotong-Liverpool University). [Paper]
?: "Transforming Image Generation from Scene Graphs", arXiv, 2022 (University of Catania, Italy). [Paper]
VisionNeRF: "Vision Transformer for NeRF-Based View Synthesis from a Single Input Image", arXiv, 2022 (Google). [Paper][Website]
NUWA-Infinity: "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis", arXiv, 2022 (Microsoft). [Paper][GitHub][Website]
Diffusion-ViT: "Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model", arXiv, 2022 (Etsy, NY). [Paper]
?: "Visual Prompt Tuning for Generative Transfer Learning", CVPR, 2023 (Google). [Paper][JAX]
SeQ-GAN: "Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis", arXiv, 2022 (Tencent). [Paper][Code (in construction)]
?: "Style-Guided Inference of Transformer for High-resolution Image Synthesis", WACV, 2023 (NCSOFT, Korea). [Paper]
Frido: "Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis", AAAI, 2023 (Microsoft). [Paper][PyTorch]
GNT: "Is Attention All That NeRF Needs?", ICLR, 2023 (UT Austin). [Paper][PyTorch][Website]
DPC: "Discrete Predictor-Corrector Diffusion Models for Image Synthesis", ICLR, 2023 (Google). [Paper]
LayoutDM: "LayoutDM: Discrete Diffusion Model for Controllable Layout Generation", CVPR, 2023 (CyberAgent, Japan). [Paper][PyTorch][Website]
GTGAN: "Graph Transformer GANs for Graph-Constrained House Generation", CVPR, 2023 (ETHZ). [Paper]
U-ViT: "All are Worth Words: A ViT Backbone for Diffusion Models", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
MQ-VAE: "Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation", CVPR, 2023 (USTC). [Paper][PyTorch]
MaskSketch: "MaskSketch: Unpaired Structure-guided Masked Image Generation", CVPR, 2023 (Google). [Paper][JAX][Website]
GAN-MAE: "Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond", CVPR, 2023 (Meituan). [Paper]
Reg-VQ: "Regularized Vector Quantization for Tokenized Image Synthesis", CVPR, 2023 (NTU, Singapore). [Paper]
LCP-GAN: "Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis", CVPR, 2023 (South China University of Technology). [Paper]
Slot-VAE: "Slot-VAE: Object-Centric Scene Generation with Slot Attention", ICML, 2023 (Delft University of Technology, Netherland). [Paper]
Efficient-VQGAN: "Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers", ICCV, 2023 (Alibaba). [Paper]
MDT: "Masked Diffusion Transformer is a Strong Image Synthesizer", ICCV, 2023 (Sea AI Lab). [Paper][PyTorch]
LayoutPrompter: "LayoutPrompter: Awaken the Design Ability of Large Language Models", NeurIPS, 2023 (Microsoft). [Paper]
LayoutGPT: "LayoutGPT: Compositional Visual Planning and Generation with Large Language Models", NeurIPS, 2023 (UCSB). [Paper][PyTorch][Website]
Diff-Instruct: "Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models", NeurIPS, 2023 (Huawei). [Paper]
VQ3D: "VQ3D: Learning a 3D-Aware Generative Model on ImageNet", arXiv, 2023 (Stanford). [Paper][Website]
LayoutDiffuse: "LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation", arXiv, 2023 (Amazon). [Paper]
StraIT: "StraIT: Non-autoregressive Generation with Stratified Image Transformer", arXiv, 2023 (Google). [Paper]
MMoT: "MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis", arXiv, 2023 (South China University of Technology). [Paper][PyTorch (in construction)][Website]
MAskDiT: "Fast Training of Diffusion Models with Masked Transformers", arXiv, 2023 (NVIDIA). [Paper]
Dolfin: "Dolfin: Diffusion Layout Transformers without Autoencoder", arXiv, 2023 (UCSD). [Paper]
RALF: "Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation", arXiv, 2023 (The University of Tokyo). [Paper][Website]
GIVT: "GIVT: Generative Infinite-Vocabulary Transformers", arXiv, 2023 (Google). [Paper]
DiffiT: "DiffiT: Diffusion Vision Transformers for Image Generation", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)]
RCG: "Self-conditioned Image Generation via Generating Representations", arXiv, 2023 (MIT). [Paper][PyTorch]
GSN: "GSN: Generalisable Segmentation in Neural Radiance Field", AAAI, 2024 (IIIT Hyderabad). [Paper][PyTorch][Website]
HDiT: "Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers", arXiv, 2024 (Stability AI). [Paper][Website]
ZigMa: "ZigMa: Zigzag Mamba Diffusion Model", arXiv, 2024 (LMU Munich). [Paper][Code (in construction)][Website]
VAR: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction", arXiv, 2024 (Bytedance). [Paper][PyTorch (in construction)][Website]

[Back to Overview]

Video Generation

Subscale: "Scaling Autoregressive Video Models", ICLR, 2020 (Google). [Paper][Website]
ConvTransformer: "ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis", arXiv, 2020 (Southeast University). [Paper]
OCVT: "Generative Video Transformer: Can Objects be the Words?", ICML, 2021 (Rutgers University). [Paper]
AIST++: "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation", arXiv, 2021 (Google). [Paper][Code][Website]
VideoGPT: "VideoGPT: Video Generation using VQ-VAE and Transformers", arXiv, 2021 (Berkeley). [Paper][PyTorch][Website]
DanceFormer: "DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer", AAAI, 2022 (Huiye Technology, China). [Paper]
VFIformer: "Video Frame Interpolation with Transformer", CVPR, 2022 (CUHK). [Paper][PyTorch]
VFIT: "Video Frame Interpolation Transformer", CVPR, 2022 (McMaster Univeristy, Canada). [Paper][PyTorch]
MoTrans: "Motion Transformer for Unsupervised Image Animation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
Transframer: "Transframer: Arbitrary Frame Prediction with Generative Models", arXiv, 2022 (DeepMind). [Paper]
TATS: "Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer", ECCV, 2022 (Maryland). [Paper][Website]
POVT: "Patch-based Object-centric Transformers for Efficient Video Generation", arXiv, 2022 (Berkeley). [Paper][PyTorch][Website]
TAIN: "Cross-Attention Transformer for Video Interpolation", arXiv, 2022 (Duke). [Paper]
TTVFI: "TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation", arXiv, 2022 (Microsoft). [Paper]
SlotFormer: "SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models", arXiv, 2022 (University of Toronto). [Paper][Website]
Human-MotionFormer: "Human MotionFormer: Transferring Human Motions with Vision Transformers", ICLR, 2023 (HKUST + Huya). [Paper][Code (in construction)]
MAGVIT: "MAGVIT: Masked Generative Video Transformer", CVPR, 2023 (Google). [Paper][Code (in construction)][Website]
MeBT: "Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers", CVPR, 2023 (Kakao). [Paper][PyTorch][Website]
BiFormer: "BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation", CVPR, 2023 (Korea University). [Paper][PyTorch (in construction)]
AMT: "AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation", CVPR, 2023 (Nankai University). [Paper][PyTorch][Website]
?: "Frame Interpolation Transformer and Uncertainty Guidance", CVPR, 2023 (Disney). [Paper]
EMA-VFI: "Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation", CVPR, 2023 (Nanjing University). [Paper][PyTorch]
EIF-BiOFNet: "Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields", CVPR, 2023 (KAIST). [Paper]
TECO: "Temporally Consistent Video Transformer for Long-Term Video Prediction", ICML, 2023 (Berkeley). [Paper][JAX][Website]
VFIFT: "Video Frame Interpolation with Flow Transformer", ACMMM, 2023 (Nanjing University of Aeronautics and Astronautics). [Paper]
ConvSSM: "Convolutional State Space Models for Long-Range Spatiotemporal Modeling", NeurIPS, 2023 (NVIDIA). [Paper][JAX]
NUWA-XL: "NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation", arXiv, 2023 (Microsoft). [Paper][Website (in construction)]
CAT-NeRF: "CAT-NeRF: Constancy-Aware Tx²Former for Dynamic Body Modeling", arXiv, 2023 (USC). [Paper]
IconShop: "IconShop: Text-Based Vector Icon Synthesis with Autoregressive Transformers", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
VDT: "VDT: An Empirical Study on Video Diffusion with Transformers", arXiv, 2023 (Renmin University of China). [Paper][PyTorch]
MAGVIT-v2: "Language Model Beats Diffusion - Tokenizer is Key to Visual Generation", arXiv, 2023 (Google). [Paper][Website]
UVDv1: "Sequential Modeling Enables Scalable Learning for Large Vision Models", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
W.A.L.T: "Photorealistic Video Generation with Diffusion Models", arXiv, 2023 (Google). [Paper][Website]
?: "SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces", ICLRW, 2024 (University of Tokyo). [Paper][PyTorch]
?: "Video as the New Language for Real-World Decision Making", arXiv, 2024 (DeepMind). [Paper]
Exo2Ego: "Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos", arXiv, 2024 (Meta). [Paper]

[Back to Overview]

Transfer / Translation / Manipulation

AdaAttN: "AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer", ICCV, 2021 (Baidu). [Paper][Paddle][PyTorch]
StyleCLIP: "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery", ICCV, 2021 (Hebrew University of Jerusalem). [Paper][PyTorch]
StyTr2: "StyTr^2: Unbiased Image Style Transfer with Transformers", CVPR, 2022 (CAS). [Paper][PyTorch]
InstaFormer: "InstaFormer: Instance-Aware Image-to-Image Translation with Transformer", CVPR, 2022 (Korea University). [Paper]
ManiTrans: "ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation", CVPR, 2022 (Huawei). [Paper][Website]
QS-Attn: "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation", CVPR, 2022 (Shanghai Key Laboratory). [Paper][PyTorch]
Splice: "Splicing ViT Features for Semantic Appearance Transfer", CVPR, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
ASSET: "ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions", SIGGRAPH, 2022 (Adobe). [Paper][PyTorch][Website]
SCAM: "SCAM! Transferring humans between images with Semantic Cross Attention Modulation", ECCV, 2022 (Univ Gustave Eiffel, France). [Paper][PyTorch][Website]
TargetCLIP: "Image-Based CLIP-Guided Essence Transfer", ECCV, 2022 (Tel Aviv). [Paper][PyTorch]
FFCLIP: "One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
STTR: "Fine-Grained Image Style Transfer with Visual Transformers", ACCV, 2022 (The Univerisity of Tokyo). [Paper][PyTorch (in construction)]
UVCGAN: "UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation", arXiv, 2022 (Brookhaven National Laboratory, NY). [Paper]
ITTR: "ITTR: Unpaired Image-to-Image Translation with Transformers", arXiv, 2022 (Kuaishou). [Paper]
CLIPasso: "CLIPasso: Semantically-Aware Object Sketching", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
CTrGAN: "CTrGAN: Cycle Transformers GAN for Gait Transfer", arXiv, 2022 (Ariel University, Israel). [Paper]
PI-Trans: "PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation", arXiv, 2022 (University of Trento, Italy). [Paper][PyTorch (in construction)]
CSLA: "Bridging CLIP and StyleGAN through Latent Alignment for Image Editing", arXiv, 2022 (Kuaishou). [Paper]
CLIP-PAE: "CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Image Manipulation", arXiv, 2022 (University of Cambridge). [Paper]
S2WAT: "S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention", arXiv, 2022 (Sichuan Normal University). [Paper]
DiffuseIT: "Diffusion-based Image Translation using Disentangled Style and Content Representation", ICLR, 2023 (KAIST). [Paper]
MATEBIT: "Masked and Adaptive Transformer for Exemplar Based Image Translation", CVPR, 2023 (Hangzhou Dianzi University). [Paper][Pytorch]
IPL: "Zero-shot Generative Model Adaptation via Image-specific Prompt Learning", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
Master: "Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer", CVPR, 2023 (NUS). [Paper]
LENeRF: "Local 3D Editing via 3D Distillation of CLIP Knowledge", CVPR, 2023 (Kakao). [Paper]
SINE: "SINE: SINgle Image Editing with Text-to-Image Diffusion Models", CVPR, 2023 (Rutgers). [Paper][PyTorch]
Imagic: "Imagic: Text-Based Real Image Editing with Diffusion Models", CVPR, 2023 (Google). [Paper][Website]
DATID-3D: "DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model", CVPR, 2023 (SNU). [Paper][PyTorch][Website]
Null-text-Inversion: "Null-text Inversion for Editing Real Images using Guided Diffusion Models", CVPR, 2023 (Google). [Paper]
LANIT: "LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data", CVPR, 2023 (Korea University). [Paper][PyTorch]
StylerDALLE: "StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model", ICCV, 2023 (University of Trento, Italy). [Paper][PyTorch]
****: "Disentangling Structure and Appearance in ViT Feature Space", ACM ToG, 2023 (Weizmann Institute of Science (WIS), Israel). [Paper][PyTorch][Website]
pix2pix-zero: "Zero-shot Image-to-Image Translation", arXiv, 2023 (Adobe). [Paper][Code (in construction)][Website]
SpectralCLIP: "SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective", arXiv, 2023 (University of Trento, Italy). [Paper][Code (in construction)]
PGIC: "A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations", arXiv, 2023 (Fudan). [Paper][Code (in construction)]

[Back to Overview]

Other Low-Level Tasks

Colorization:
- ColTran: "Colorization Transformer", ICLR, 2021 (Google). [Paper][Tensorflow]
- ViT-I-GAN: "ViT-Inception-GAN for Image Colourising", arXiv, 2021 (D.Y Patil College of Engineering, India). [Paper]
- CT²: "CT²: Colorization Transformer via Color Tokens", ECCV, 2022 (Peking University). [Paper][PyTorch]
- L-CoDer: "L-CoDer: Language-based Colorization with Color-object Decoupling Transformer", ECCV, 2022 (Beijing University of Posts and Telecommunications). [Paper]
- ColorFormer: "ColorFormer: Image Colorization via Color Memory assisted Hybrid-attention Transformer", ECCV, 2022 (Tencent). [Paper]
- UniColor: "UniColor: A Unified Framework for Multi-Modal Colorization with Transformer", SIGGRAPH Asia, 2022 (CUHK). [Paper][Website]
- iColoriT: "iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer", arXiv, 2022 (KAIST). [Paper]
- L-CoIns: "L-CoIns: Language-based Colorization with Instance Awareness", CVPR, 2023 (Beijing University of Posts and Telecommunications). [Paper]
- L-CAD: "L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors", NeurIPS, 2023 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
Enhancement:
- PanFormer: "PanFormer: a Transformer Based Model for Pan-sharpening", ICME, 2022 (Beihang University). [Paper][PyTorch]
- URSCT-UIE: "Reinforced Swin-Convs Transformer for Underwater Image Enhancement", arXiv, 2022 (Ningbo University). [Paper]
- IAT: "Illumination Adaptive Transformer", arXiv, 2022 (The University of Tokyo). [Paper][PyTorch]
- SPGAT: "Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper]
- SSTF: "End-to-end Transformer for Compressed Video Quality Enhancement", arXiv, 2022 (Nanjing University of Information Science and Technology). [Paper]
- CLIP-LiT: "Iterative Prompt Learning for Unsupervised Backlit Image Enhancement", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- Retinexformer: "Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
High Dynamic Range (HDR):
- CA-ViT: "Ghost-free High Dynamic Range Imaging with Context-aware Transformer", ECCV, 2022 (Megvii). [Paper][PyTorch]
- Selective-TransHDR: "Selective TransHDR: Transformer-Based Selective HDR Imaging Using Ghost Region Mask", ECCV, 2022 (Sogang University, Korea). [Paper]
- Text2Light: "Text2Light: Zero-Shot Text-Driven HDR Panorama Generation", SIGGRAPH Asia, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
- SMAE: "SMAE: Few-shot Learning for HDR Deghosting with Saturation-Aware Masked Autoencoders", CVPR, 2023 (Northwestern Polytechnical University). [Paper]
- SCTNet: "Alignment-free HDR Deghosting with Semantics Consistent Transformer", ICCV, 2023 (University of Bourgogne, France). [Paper][Website]
- ?: "Online Overexposed Pixels Hallucination in Videos with Adaptive Reference Frame Selection", arXiv, 2023 (NVIDIA). [Paper]
- IFT: "IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging", arXiv, 2023 (Huawei). [Paper]
Harmonization:
- HT: "Image Harmonization With Transformer", ICCV, 2021 (Ocean University of China). [Paper]
- LEMaRT: "LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization", CVPR, 2023 (Amazon). [Paper]
Compression:
- ?: "Towards End-to-End Image Compression and Analysis with Transformers", AAAI, 2022 (1Harbin Institute of Technology). [Paper][PyTorch]
- Entroformer: "Entroformer: A Transformer-based Entropy Model for Learned Image Compression", ICLR, 2022 (Alibaba). [Paper]
- STF: "The Devil Is in the Details: Window-based Attention for Image Compression", CVPR, 2022 (CAS). [Paper][PyTorch]
- Contextformer: "Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression", ECCV, 2022 (TUM). [Paper]
- VCT: "VCT: A Video Compression Transformer", NeurIPS, 2022 (Google). [Paper]
- MIMT: "MIMT: Masked Image Modeling Transformer for Video Compression", ICLR, 2023 (Tencent). [Paper]
- TCM: "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023 (Waseda University). [Paper][PyTorch]
- TransTIC: "TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception", ICCV, 2023 (NYCU). [Paper]
- Prompt-ICM: "Prompt-ICM: A Unified Framework towards Image Coding for Machines with Task-driven Prompts", arXiv, 2023 (USTC). [Paper]
- FAT-LIC: "Frequency-Aware Transformer for Learned Image Compression", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
Matting:
- MatteFormer: "MatteFormer: Transformer-Based Image Matting via Prior-Tokens", CVPR, 2022 (SNU + NAVER). [Paper][PyTorch]
- TransMatting: "TransMatting: Enhancing Transparent Objects Matting with Transformers", ECCV, 2022 (CAS). [Paper][Code (in construction)]
- VMFormer: "VMFormer: End-to-End Video Matting with Transformer", arXiv, 2022 (PicsArt). [Paper][PyTorch][Website]
- CLIPMat: "Referring Image Matting", CVPR, 2023 (The University of Sydney). [Paper][Code (in construction)]
- ViTMatte: "ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers", arXiv, 2023 (Xiaobing.AI). [Paper]
- MAM: "Matting Anything", arXiv, 2023 (UIUC). [Paper][PyTorch][Website]
- MaGGIe: "MaGGIe: Masked Guided Gradual Human Instance Matting", CVPR, 2024 (Adobe). [Paper][Website]
Reconstruction:
- ET-Net: "Event-Based Video Reconstruction Using Transformer", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
- GradViT: "GradViT: Gradient Inversion of Vision Transformers", CVPR, 2022 (NVIDIA). [Paper][Website]
- MST: "Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
- MST++: "MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction", CVPRW, 2022 (Tsinghua). [Paper][PyTorch]
- CST: "Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
- DAUHST: "Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
- S²-Transformer: "S²-Transformer for Mask-Aware Hyperspectral Image Reconstruction", arXiv, 2022 (Rochester Institute of Technology). [Paper]
- NLOST: "NLOST: Non-Line-of-Sight Imaging with Transformer", CVPR, 2023 (USTC). [Paper][Code (in construction)]
- MinD-Vis: "Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding", CVPR, 2023 (NUS). [Paper][PyTorch][Website]
- PADUT: "Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction", ICCV, 2023 (Beijing Institute of Technology). [Paper][PyTorch]
- GTA: "Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction", NeurIPS, 2023 (Zhejiang). [Paper]
Radiance Fields:
- NeXT: "NeXT: Towards High Quality Neural Radiance Fields via Multi-Skip Transformer", ECCV, 2022 (Tsinghua University). [Paper][JAX]
- TransNeRF: "Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer", arXiv, 2022 (UBC). [Paper]
- ABLE-NeRF: "ABLE-NeRF: Attention-Based Rendering with Learnable Embeddings for Neural Radiance Field", CVPR, 2023 (NTU, Singapore). [Paper]
- TransHuman: "TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering", ICCV, 2023 (Alibaba). [Paper][PyTorch][Website]
- GNT-MOVE: "Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts", ICCV, 2023 (UT Austin). [Paper][PyTorch]
- ReTR: "ReTR: Modeling Rendering Via Transformer for Generalizable Neural Surface Reconstruction", NeurIPS, 2023 (HKUST). [Paper][PyTorch]
3D:
- MNSRNet: "MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution", CVPR, 2022 (Shenzhen University). [Paper]
Others:
- TransMEF: "TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning", AAAI, 2022 (Fudan). [Paper]
- MS-Unet: "Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer", CVPR, 2022 (Megvii). [Paper][Code (in construction)]
- TransCL: "TransCL: Transformer Makes Strong and Flexible Compressive Learning", TPAMI, 2022 (Peking University). [Paper][Code (in construction)]
- GAP-CSCoT: "Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer", arXiv, 2022 (CAS). [Paper]
- MatFormer: "MatFormer: A Generative Model for Procedural Materials", arXiv, 2022 (Adobe). [Paper]
- FishFormer: "FishFormer: Annulus Slicing-based Transformer for Fisheye Rectification with Efficacy Domain Exploration", arXiv, 2022 (Beijing Jiaotong University). [Paper]
- STFormer: "Spatial-Temporal Transformer for Video Snapshot Compressive Imaging", arXiv, 2022 (CAS). [Paper][PyTorch]
- OCTUF: "Optimization-Inspired Cross-Attention Transformer for Compressive Sensing", CVPR, 2023 (Peking University). [Paper][PyTorch]
- TopNet: "TopNet: Transformer-based Object Placement Network for Image Compositing", CVPR, 2023 (Adobe). [Paper]
- RHWF: "Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer", CVPR, 2023 (Zhejiang University). [Paper][Code (in construction)]
- M2T: "M2T: Masking Transformers Twice for Faster Decoding", ICCV, 2023 (Google). [Paper]
- CTM: "Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging", ICCV, 2023 (CAS). [Paper]
- PromptGIP: "Unifying Image Processing as Visual Prompting Question Answering", arXiv, 2023 (Shanghai AI Lab). [Paper]
- FILM: "Image Fusion via Vision-Language Model", arXiv, 2024 (Xi'an Jiaotong University). [Paper]

[Back to Overview]

Reinforcement Learning

Navigation

VTNet: "VTNet: Visual Transformer Network for Object Goal Navigation", ICLR, 2021 (ANU). [Paper]
MaAST: "MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation", ICRA, 2021 (SRI). [Paper]
TransFuser: "Multi-Modal Fusion Transformer for End-to-End Autonomous Driving", CVPR, 2021 (MPI). [Paper][PyTorch]
CMTP: "Topological Planning With Transformers for Vision-and-Language Navigation", CVPR, 2021 (Stanford). [Paper]
VLN-BERT: "VLN-BERT: A Recurrent Vision-and-Language BERT for Navigation", CVPR, 2021 (ANU). [Paper][PyTorch]
E.T.: "Episodic Transformer for Vision-and-Language Navigation", ICCV, 2021 (Google). [Paper][PyTorch]
HAMT: "History Aware Multimodal Transformer for Vision-and-Language Navigation", NeurIPS, 2021 (INRIA). [Paper][PyTorch][Website]
SOAT: "SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation", NeurIPS, 2021 (Georgia Tech). [Paper]
OMT: "Object Memory Transformer for Object Goal Navigation", ICRA, 2022 (AIST, Japan). [Paper]
ADAPT: "ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts", CVPR, 2022 (Huawei). [Paper]
DUET: "Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation", CVPR, 2022 (INRIA). [Paper][Website]
LSA: "Local Slot Attention for Vision-and-Language Navigation", ICMR, 2022 (Fudan). [Paper]
?: "Learning from Unlabeled 3D Environments for Vision-and-Language Navigation", ECCV, 2022 (INRIA). [Paper][Website]
MTVM: "Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
DDL: "Learning Disentanglement with Decoupled Labels for Vision-Language Navigation", ECCV, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
Sim2Sim: "Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments", ECCV, 2022 (Oregon State University). [Paper][PyTorch][Website]
AVLEN: "AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments", NeurIPS, 2022 (UC Riverside). [Paper]
ZSON: "ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings", NeurIPS, 2022 (Georgia Tech). [Paper]
WS-MGMap: "Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation", NeurIPS, 2022 (South China University of Technology). [Paper][PyTorch (in construction)]
CLIP-Nav: "CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation", CoRLW, 2022 (Amazon). [Paper]
TransFuser: "TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving", arXiv, 2022 (MPI). [Paper]
TD-STP: "Target-Driven Structured Transformer Planner for Vision-Language Navigation", arXiv, 2022 (Beihang University). [Paper][Code (in construction)]
DAVIS: "Anticipating the Unseen Discrepancy for Vision and Language Navigation", arXiv, 2022 (UCSB). [Paper]
LOViS: "LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation", arXiv, 2022 (Michigan State). [Paper]
BEVBert: "BEVBert: Topo-Metric Map Pre-training for Language-guided Navigation", arXiv, 2022 (CAS). [Paper]
Meta-Explore: "Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding", CVPR, 2023 (Seoul National University). [Paper][Website]
LANA: "Lana: A Language-Capable Navigator for Instruction Following and Generation", CVPR, 2023 (Zhejiang University). [Paper][PyTorch (in construction)]
KERM: "KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation", CVPR, 2023 (CAS). [Paper][PyTorch]
VLN-SIG: "Improving Vision-and-Language Navigation by Generating Future-View Image Semantics", CVPR, 2023 (UNC). [Paper][PyTorch][Website]
GeoVLN: "GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation", CVPR, 2023 (Fudan). [Paper]
IVLN: "Iterative Vision-and-Language Navigation", CVPR, 2023 (Oregon State University). [Paper]
AZHP: "Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation", CVPR, 2023 (Beihang University). [Paper][Code (in construction)]
MARVAL: "A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning", CVPR, 2023 (Google). [Paper]
VO-Transformer: "Modality-invariant Visual Odometry for Embodied Vision", CVPR, 2023 (EPFL). [Paper][Website]
VLN-Behave: "Behavioral Analysis of Vision-and-Language Navigation Agents", CVPR, 2023 (Oregon State). [Paper][Code]
Lily: "Learning Vision-and-Language Navigation from YouTube Videos", ICCV, 2023 (South China University of Technology). [Paper][PyTorch]
ScaleVLN: "Scaling Data Generation in Vision-and-Language Navigation", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
BSG: "Bird's-Eye-View Scene Graph for Vision-Language Navigation", ICCV, 2023 (Zhejiang University). [Paper][Code (in construction)]
AerialVLN: "AerialVLN: Vision-and-Language Navigation for UAVs", ICCV, 2023 (Northwestern Polytechnical University). [Paper][PyTorch]
DREAMWALKER: "DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation", ICCV, 2023 (Beijing Institute of Technology). [Paper][Code (in construction)]
VLN-PETL: "VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation", ICCV, 2023 (The University of Adelaide, Australia). [Paper][Code (in construction)]
MiC: "March in Chat: Interactive Prompting for Remote Embodied Referring Expression", ICCV, 2023 (The University of Adelaide, Australia). [Paper][Code (in construction)]
GELA: "Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation", ICCV, 2023 (Chinese Academy of Military Science). [Paper][PyTorch]
GridMM: "GridMM: Grid Memory Map for Vision-and-Language Navigation", ICCV, 2023 (CAS). [Paper][PyTorch]
LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 (OSU). [Paper][Code (in construction)][Website]
Le-RNR-Map: "Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language", ICCVW, 2023 (University of Verona, Italy). [Paper][Code (in construction)][Website]
LACMA: "LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following", EMNLP, 2023 (Microsoft). [Paper][PyTorch]
FGPrompt: "FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation", NeurIPS, 2023 (South China University of Technology). [Paper][PyTorch][Website]
PanoGen: "PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation", NeurIPS, 2023 (UNC). [Paper][PyTorch][Website]
MLANet: "MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation", arXiv, 2023 (Tongji University). [Paper][PyTorch]
ENTL: "ENTL: Embodied Navigation Trajectory Learner", arXiv, 2023 (AI2). [Paper]
MPM: "Masked Path Modeling for Vision-and-Language Navigation", arXiv, 2023 (UCLA). [Paper]
NavGPT: "NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models", arXiv, 2023 (The University of Adelaide, Australia). [Paper]
MO-VLN: "MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation", arXiv, 2023 (Sun Yat-Sen University). [Paper][Code (in construction)][Website]
ViNT: "ViNT: A Foundation Model for Visual Navigation", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
A²Nav: "A²Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models", arXiv, 2023 (South China University of Technology). [Paper]
LangNav: "LangNav: Language as a Perceptual Representation for Navigation", arXiv, 2023 (MIT). [Paper]
?: "Multimodal Large Language Model for Visual Navigation", arXiv, 2023 (Apple). [Paper]
VLN-Video: "VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation", AAAI, 2024 (Amazon). [Paper]
MemoNav: "MemoNav: Working Memory Model for Visual Navigation", CVPR, 2024 (CAS). [Paper]
OVER-NAV: "OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation", CVPR, 2024 (HKU). [Paper]
HNR: "Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation", CVPR, 2024 (CAS). [Paper][Code (in construction)]
GOAT: "Vision-and-Language Navigation via Causal Learning", CVPR, 2024 (Tongji University). [Paper][PyTorch]
MapGPT: "MapGPT: Map-Guided Prompting for Unified Vision-and-Language Navigation", arXiv, 2024 (HKU). [Paper]
V-IRL: "V-IRL: Grounding Virtual Intelligence in Real Life", arXiv, 2024 (NYU). [Paper][PyTorch (in construction)][Website]

[Back to Overview]

Other RL Tasks

SVEA: "Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation", arXiv, 2021 (UCSD). [Paper][GitHub][Website]
LocoTransformer: "Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers", ICLR, 2022 (UCSD). [Paper][Website]
STAM: "Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes", CVPR, 2022 (McGill University, Canada). [Paper][PyTorch]
CtrlFormer: "CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer", ICML, 2022 (HKU). [Paper][PyTorch][Website]
PromptDT: "Prompting Decision Transformer for Few-Shot Policy Generalization", ICML, 2022 (CMU). [Paper][Website]
StARformer: "StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning", ECCV, 2022 (Stony Brook). [Paper][PyTorch]
RAD: "Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels", arXiv, 2022 (UBC, Canada). [Paper]
MWM: "Masked World Models for Visual Control", arXiv, 2022 (Berkeley). [Paper][Tensorflow][Website]
IRIS: "Transformers are Sample Efficient World Models", arXiv, 2022 (University of Geneva, Switzerland). [Paper][PyTorch]
InstructRL: "Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models", arXiv, 2022 (Google). [Paper]
STG-Transformer: "Learning from Visual Observation via Offline Pretrained State-to-Go Transformer", NeurIPS, 2023 (BAAI). [Paper][Code (in construction)][Website]
RL4VLM: "Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning", arXiv, 2024 (Berkeley + NYU). [Paper][PyTorch][Website]

[Back to Overview]

Medical

Medical Segmentation

Cross-Transformer: "The entire network structure of Crossmodal Transformer", ICBSIP, 2021 (Capital Medical University). [Paper]
Segtran: "Medical Image Segmentation using Squeeze-and-Expansion Transformers", IJCAI, 2021 (A*STAR). [Paper]
i-ViT: "Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image", MICCAI, 2021 (Xi'an Jiaotong University). [Paper][PyTorch][Website]
UTNet: "UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation", MICCAI, 2021 (Rutgers). [Paper]
MCTrans: "Multi-Compound Transformer for Accurate Biomedical Image Segmentation", MICCAI, 2021 (HKU + CUHK). [Paper][Code (in construction)]
Polyformer: "Few-Shot Domain Adaptation with Polymorphic Transformers", MICCAI, 2021 (A*STAR). [Paper][PyTorch]
BA-Transformer: "Boundary-aware Transformers for Skin Lesion Segmentation". MICCAI, 2021 (Xiamen University). [Paper][PyTorch]
GT-U-Net: "GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation", MICCAIW, 2021 (Hangzhou Dianzi University). [Paper][PyTorch]
STN: "Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation", ISBI, 2021 (Institut Polytechnique de Paris). [Paper]
T-AutoML: "T-AutoML: Automated Machine Learning for Lesion Segmentation Using Transformers in 3D Medical Imaging", ICCV, 2021 (NVIDIA). [Paper]
MedT: "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
Convolution-Free: "Convolution-Free Medical Image Segmentation using Transformers", arXiv, 2021 (Harvard). [Paper]
CoTR: "CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation", arXiv, 2021 (Northwestern Polytechnical University). [Paper][PyTorch]
TransBTS: "TransBTS: Multimodal Brain Tumor Segmentation Using Transformer", arXiv, 2021 (University of Science and Technology Beijing). [Paper][PyTorch]
SpecTr: "SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation", arXiv, 2021 (East China Normal University). [Paper][Code (in construction)]
U-Transformer: "U-Net Transformer: Self and Cross Attention for Medical Image Segmentation", arXiv, 2021 (CEDRIC). [Paper]
TransUNet: "TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
PMTrans: "Pyramid Medical Transformer for Medical Image Segmentation", arXiv, 2021 (Washington University in St. Louis). [Paper]
PBT-Net: "Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy", arXiv, 2021 (Hangzhou Dianzi University). [Paper]
Swin-Unet: "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation", arXiv, 2021 (Huawei). [Paper][Code (in construction)]
MBT-Net: "A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation", arXiv, 2021 (Southern University of Science and Technology). [Paper]
WAD: "More than Encoder: Introducing Transformer Decoder to Upsample", arXiv, 2021 (South China University of Technology). [Paper]
LeViT-UNet: "LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation", arXiv, 2021 (Wuhan Institute of Technology). [Paper]
?: "Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation", arXiv, 2021 (Vanderbilt University). [Paper]
nnFormer: "nnFormer: Interleaved Transformer for Volumetric Segmentation", arXiv, 2021 (HKU + Xiamen University). [Paper][PyTorch]
MISSFormer: "MISSFormer: An Effective Medical Image Segmentation Transformer", arXiv, 2021 (Beijing University of Posts and Telecommunications). [Paper]
TUnet: "Transformer-Unet: Raw Image Processing with Unet", arXiv, 2021 (Beijing Zoezen Robot + Beihang University). [Paper]
BiTr-Unet: "BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation", arXiv, 2021 (New York University). [Paper]
?: "Transformer Assisted Convolutional Network for Cell Instance Segmentation", arXiv, 2021 (IIT Dhanbad). [Paper]
?: "Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining", arXiv, 2021 (Ukrainian Catholic University). [Paper]
UNETR: "UNETR: Transformers for 3D Medical Image Segmentation", WACV, 2022 (NVIDIA). [Paper][PyTorch]
AFTer-UNet: "AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation", WACV, 2022 (UC Irvine). [Paper]
UCTransNet: "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer", AAAI, 2022 (Northeastern University, China). [Paper][PyTorch]
Swin-UNETR: "Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis", CVPR, 2022 (NVIDIA). [Paper][PyTorch]
?: "Transformer-based out-of-distribution detection for clinically safe segmentation", Medical Imaging with Deep Learning (MIDL), 2022 (King’s College London). [Paper]
ScaleFormer: "ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation", IJCAI, 2022 (Zhejiang University). [Paper][Code (in construction)]
FCBFormer: "FCN-Transformer Feature Fusion for Polyp Segmentation", Annual Conference on Medical Image Understanding and Analysis (MIUA), 2022 (University of Central Lancashire, UK). [Paper][PyTorch]
UAMT-ViT: "An uncertainty-aware transformer for MRI cardiac semantic segmentation via mean teachers", Medical Image Understanding and Analysis (MIUA), 2022 (Oxford). [Paper][PyTorch]
VDFormer: "View-Disentangled Transformer for Brain Lesion Detection", ISBI, 2022 (CUHK). [Paper][PyTorch]
TFCNs: "TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation", International Conference on Artificial Neural Networks (ICANN), 2022 (Xiamen University). [Paper][PyTorch (in construction)]
MIL: "Transformer based multiple instance learning for weakly supervised histopathology image segmentation", MICCAI, 2022 (Beihang University). [Paper]
mmFormer: "mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation", MICCAI, 2022 (CAS). [Paper][PyTorch]
Patcher: "Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation", MICCAI, 2022 (Pennsylvania State University). [Paper]
NestedFormer: "NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation", MICCAI, 2022 (Tianjin University). [Paper][Code (in construction)]
TransDeepLab: "TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical Image Segmentation", MICCAIW, 2022 (RWTH Aachen University, Germany). [Paper][PyTorch]
CESSViT: "Computationally-Efficient Vision Transformer for Medical Image Semantic Segmentation via Dual Pseudo-Label Supervision", ICIP, 2022 (Oxford). [Paper][PyTorch]
S4CVNet: "When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation", ECCVW, 2022 (Oxford). [Paper][PyTorch]
Video-TransUNet: "Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation", International Conference on Machine Vision (ICMV), 2022 (University of Bristol, UK). [Paper]
TransResNet: "TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting", BMVC, 2022 (MBZUAI). [Paper]
CAAViT: "Adversarial Vision Transformer for Medical Image Semantic Segmentation with Limited Annotations", BMVC, 2022 (Oxford). [Paper][PyTorch][Supp]
CASTformer: "Class-Aware Adversarial Transformers for Medical Image Segmentation", NeurIPS, 2022 (Yale). [Paper]
TransNorm: "TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model", IEEE Access, 2022 (Aachen University, Germany). [Paper][PyTorch]
Tempera: "Tempera: Spatial Transformer Feature Pyramid Network for Cardiac MRI Segmentation", arXiv, 2022 (ICL). [Paper]
UTNetV2: "A Multi-scale Transformer for Medical Image Segmentation: Architectures, Model Efficiency, and Benchmarks", arXiv, 2022 (Rutgers). [Paper]
UNesT: "Characterizing Renal Structures with 3D Block Aggregate Transformers", arXiv, 2022 (Vanderbilt University, Tennessee). [Paper]
PHTrans: "PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
UNeXt: "UNeXt: MLP-based Rapid Medical Image Segmentation Network", arXiv, 2022 (JHU). [Paper][PyTorch]
TransFusion: "TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers", arXiv, 2022 (Rutgers). [Paper]
UNetFormer: "UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation", arXiv, 2022 (NVIDIA). [Paper][GitHub]
3D-Shuffle-Mixer: "3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
?: "Continual Hippocampus Segmentation with Transformers", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
TranSiam: "TranSiam: Fusing Multimodal Visual Features Using Transformer for Medical Image Segmentation", arXiv, 2022 (Tianjin University). [Paper]
ColonFormer: "ColonFormer: An Efficient Transformer based Method for Colon Polyp Segmentation", arXiv, 2022 (Hanoi University of Science and Technology). [Paper]
?: "Transformer based Generative Adversarial Network for Liver Segmentation", arXiv, 2022 (Northwestern University). [Paper]
FCT: "The Fully Convolutional Transformer for Medical Image Segmentation", arXiv, 2022 (University of Glasgow, UK). [Paper]
XBound-Former: "XBound-Former: Toward Cross-scale Boundary Modeling in Transformers", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
Polyp-PVT: "Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers", arXiv, 2022 (IIAI). [Paper][PyTorch]
SeATrans: "SeATrans: Learning Segmentation-Assisted diagnosis model via Transformer", arXiv, 2022 (Baidu). [Paper]
TransResU-Net: "TransResU-Net: Transformer based ResU-Net for Real-Time Colonoscopy Polyp Segmentation", arXiv, 2022 (Indira Gandhi National Open University). [Paper][Code (in construction)]
LViT: "LViT: Language meets Vision Transformer in Medical Image Segmentation", arXiv, 2022 (Alibaba). [Paper][Code (in construction)]
APFormer: "The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning", arXiv, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
?: "Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images", arXiv, 2022 (University of Rennes, France). [Paper][Tensorflow]
CKD-TransBTS: "CKD-TransBTS: Clinical Knowledge-Driven Hybrid Transformer with Modality-Correlated Cross-Attention for Brain Tumor Segmentation", arXiv, 2022 (South China University of Technology). [Paper]
?: "Contextual Attention Network: Transformer Meets U-Net", arXiv, 2022 (RWTH Aachen University). [Paper][PyTorch]
HRSTNet: "High-Resolution Swin Transformer for Automatic Medical Image Segmentation", arXiv, 2022 (Xi'an University of Posts and Telecommunications). [Paper][Code (in construction)]
CM-MLP: "CM-MLP: Cascade Multi-scale MLP with Axial Context Relation Encoder for Edge Segmentation of Medical Image", arXiv, 2022 (Zhengzhou University). [Paper]
CATS: "Cats: Complementary CNN and Transformer Encoders for Segmentation", arXiv, 2022 (Vanderbilt University, Nashville). [Paper]
TFusion: "TFusion: Transformer based N-to-One Multimodal Fusion Block", arXiv, 2022 (SouthChinaUniversityofTechnology). [Paper]
AutoPET: "AutoPET Challenge: Combining nn-Unet with Swin UNETR Augmented by Maximum Intensity Projection Classifier", arXiv, 2022 (University Hospital Essen, Germany). [Paper]
SPAN: "Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers", arXiv, 2022 (Berkeley). [Paper]
TMSS: "TMSS: An End-to-End Transformer-based Multimodal Network for Segmentation and Survival Prediction", arXiv, 2022 (MBZUAI). [Paper]
CR-Swin2-VT: "Hybrid Window Attention Based Transformer Architecture for Brain Tumor Segmentation", arXiv, 2022 (Monash University). [Paper][PyTorch]
FocalUNETR: "FocalUNETR: A Focal Transformer for Boundary-aware Segmentation of CT Images", arXiv, 2022 (Wayne State University, Detroit). [Paper]
LAPFormer: "LAPFormer: A Light and Accurate Polyp Segmentation Transformer", arXiv, 2022 (Sun* Inc, Hanoi). [Paper]
FINE: "Memory transformers for full context and high-resolution 3D Medical Segmentation", arXiv, 2022 (National Conservatory of Arts and Crafts, France). [Paper]
ConvTransSeg: "ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation", arXiv, 2022 (University of Nottingham, UK). [Paper]
CS-Unet: "Optimizing Vision Transformers for Medical Image Segmentation and Few-Shot Domain Adaptation", arXiv, 2022 (University of Glasgow, UK). [Paper]
UNETR++: "UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
HiFormer: "HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation", WACV, 2023 (Iran University of Science and Technology). [Paper][PyTorch]
Att-SwinU-Net: "Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin Lesion Segmentation", IEEE ISBI, 2023 (Shahid Beheshti University, Iran). [Paper][PyTorch]
3DUX-Net: "3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation", ICLR, 2023 (Vanderbilt University). [Paper][PyTorch]
?: "Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization", CVPR, 2023 (Alibaba). [Paper]
CVM: "Weakly supervised segmentation with point annotations for histopathology images via contrast-based variational model", CVPR, 2023 (University of Liverpool, UK). [Paper]
MAESTER: "MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition", CVPR, 2023 (University of Toronto). [Paper]
Universal-Model: "CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection", ICCV, 2023 (JHU). [Paper][PyTorch]
MDViT: "MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets", MICCAI, 2023 (UBC). [Paper][PyTorch]
ConvFormer: "ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation", MICCAI, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
TP-SIS: "Text Promptable Surgical Instrument Segmentation with Vision-Language Models", NeurIPS, 2023 (King's College London). [Paper][PyTorch]
UniSeg: "UniSeg: A Prompt-driven Universal Segmentation Model as well as A Strong Representation Learner", arXiv, 2023 (Northwestern Polytechnical University, China). [Paper][PyTorch (in construction)]
UniverSeg: "UniverSeg: Universal Medical Image Segmentation", arXiv, 2023 (MIT). [Paper][PyTorch][Website]
3DSAM-adapter: "3DSAM-adapter: Holistic Adaptation of SAM from 2D to 3D for Promptable Medical Image Segmentation", arXiv, 2023 (CUHK). [Paper]
CMCL: "Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training", arXiv, 2023 (NVIDIA). [Paper]
AdaptiveSAM: "AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation", arXiv, 2023 (JHU). [Paper][PyTorch]
SAM-Med2D: "SAM-Med2D", arXiv, 2023 (Shanghai AI Lab). [Paper][Pytorch]
SAM-Med3D: "SAM-Med3D", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
H-SAM: "Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding", CVPR, 2024 (East China Normal University). [Paper][PyTorch]

[Back to Overview]

Medical Classification

COVID19T: "A Transformer-Based Framework for Automatic COVID19 Diagnosis in Chest CTs", ICCVW, 2021 (?). [Paper][PyTorch]
TransMIL: "TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication", NeurIPS, 2021 (Tsinghua University). [Paper][PyTorch]
TransMed: "TransMed: Transformers Advance Multi-modal Medical Image Classification", arXiv, 2021 (Northeastern University). [Paper]
CXR-ViT: "Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification", arXiv, 2021 (KAIST). [Paper]
ViT-TSA: "Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer", arXiv, 2021 (Queen’s University). [Paper]
GasHis-Transformer: "GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification", arXiv, 2021 (Northeastern University). [Paper]
POCFormer: "POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound", arXiv, 2021 (The Ohio State University). [Paper]
COVID-ViT: "COVID-VIT: Classification of COVID-19 from CT chest images based on vision transformer models", arXiv, 2021 (Middlesex University, UK). [Paper][PyTorch]
EEG-ConvTransformer: "EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification", arXiv, 2021 (IIT Ropar). [Paper]
CCAT: "Visual Transformer with Statistical Test for COVID-19 Classification", arXiv, 2021 (NCKU). [Paper]
M3T: "M3T: Three-Dimensional Medical Image Classifier Using Multi-Plane and Multi-Slice Transformer", CVPR, 2022 (Yonsei University). [Paper]
?: "A comparative study between vision transformers and CNNs in digital pathology", CVPRW, 2022 (Roche, Switzerland). [Paper]
SCT: "Context-Aware Transformers For Spinal Cancer Detection and Radiological Grading", MICCAI, 2022 (Oxford). [Paper]
KAT: "Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification", MICCAI, 2022 (Beihang University). [Paper][PyTorch]
SEViT: "Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification", MICCAI, 2022 (MBZUAI). [Paper][PyTorch]
MF-ViT: "Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis", MICCAIW, 2022 (Rutgers University). [Paper][PyTorch]
SB-SSL: "SB-SSL: Slice-Based Self-Supervised Transformers for Knee Abnormality Classification from MRI", MICCAIW, 2022 (University of Surrey, UK). [Paper]
RadioTransformer: "RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification", ECCV, 2022 (Stony Brook). [Paper][Tensorflow (in construction)]
ScoreNet: "ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification", arXiv, 2022 (EPFL). [Paper]
LA-MIL: "Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction", arXiv, 2022 (TUM). [Paper]
HoVer-Trans: "HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images", arXiv, 2022 (South China University of Technology). [Paper]
GTP: "A graph-transformer for whole slide image classification", IEEE Transactions on Medical Imaging (TMI), 2022 (Boston University). [Paper][PyTorch]
?: "Zero-Shot and Few-Shot Learning for Lung Cancer Multi-Label Classification using Vision Transformer", arXiv, 2022 (Harvard). [Paper]
SwinCheX: "SwinCheX: Multi-label classification on chest X-ray images with transformers", arXiv, 2022 (Sharif University of Technology, Iran). [Paper]
SGT: "Rectify ViT Shortcut Learning by Visual Saliency", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
IPMN-ViT: "Neural Transformers for Intraductal Papillary Mucosal Neoplasms (IPMN) Classification in MRI images", arXiv, 2022 (University of Catania, Italy). [Paper]
?: "Multi-Label Retinal Disease Classification using Transformers", arXiv, 2022 (Khalifa University, UAE). [Paper][PyTorch]
TractoFormer: "TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers", arXiv, 2022 (Harvard). [Paper]
BrainFormer: "BrainFormer: A Hybrid CNN-Transformer Model for Brain fMRI Data Classification", arXiv, 2022 (Chinese PLA General Hospital). [Paper]
SI-ViT: "Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification", arXiv, 2022 (Beihang University). [Paper][PyTorch]
IPS: "Iterative Patch Selection for High-Resolution Image Recognition", ICLR, 2023 (Hasso Plattner Institute, Germany). [Paper]
ILRA-MIL: "Exploring Low-Rank Property in Multiple Instance Learning for Whole Slide Image Classification", ICLR, 2023 (Tencent). [Paper]
BolT: "BolT: Fused window transformers for fMRI time series analysis", Medical Image Analysis, 2023 (Bilkent University). [Paper][PyTorch]
TOP: "The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification", NeurIPS, 2023 (Fudan). [Paper][Code (in construction)]
DreaMR: "DreaMR: Diffusion-driven Counterfactual Explanation for Functional MRI", arXiv, 2023 (Bilkent University). [Paper][PyTorch]
LongViT: "When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology", arXiv, 2023 (Microsoft). [Paper][PyTorch]
FiVE: "Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction", CVPR, 2024 (Xiamen University). [Paper]
FocusMAE: "FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders", CVPR, 2024 (IIT Delhi). [Paper][Code (in construction)]

[Back to Overview]

Medical Detection

COTR: "COTR: Convolution in Transformer Network for End to End Polyp Detection", arXiv, 2021 (Fuzhou University). [Paper]
TR-Net: "Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries", arXiv, 2021 (Harbin Institute of Technology). [Paper]
CAE-Transformer: "CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans", arXiv, 2021 (Concordia University, Canada). [Paper]
SwinFPN: "SwinFPN: Leveraging Vision Transformers for 3D Organs-At-Risk Detection", MIDL, 2022 (TUM). [Paper][PyTorch]
DATR: "DATR: Domain-adaptive transformer for multi-domain landmark detection", arXiv, 2022 (CAS). [Paper]
SATr: "SATr: Slice Attention with Transformer for Universal Lesion Detection", arXiv, 2022 (CAS). [Paper]
AC-Former: "Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
PGT: "Prompt-based Grouping Transformer for Nucleus Detection and Classification", MICCAI, 2023 (Sun Yat-sen University). [Paper][PyTorch]
Focused-Decoder: "Focused Decoding Enables 3D Anatomical Detection by Transformers", MELBA, 2023 (University of Zurich). [Paper][PyTorch][Website]

[Back to Overview]

Medical Reconstruction

T²Net: "Task Transformer Network for Joint MRI Reconstruction and Super-Resolution", MICCAI, 2021 (Harbin Institute of Technology). [Paper][PyTorch]
FIT: "Fourier Image Transformer", arXiv, 2021 (MPI). [Paper][PyTorch]
SLATER: "Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers", arXiv, 2021 (Bilkent University). [Paper]
MTrans: "MTrans: Multi-Modal Transformer for Accelerated MR Imaging", arXiv, 2021 (Harbin Institute of Technology). [Paper][PyTorch]
SDAUT: "Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI", MICCAI, 2022 (ICL). [Paper]
?: "Adaptively Re-weighting Multi-Loss Untrained Transformer for Sparse-View Cone-Beam CT Reconstruction", arXiv, 2022 (Zhejiang Lab). [Paper]
K-Space-Transformer: "K-Space Transformer for Fast MRI Reconstruction with Implicit Representation", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][Code (in construction)][Website]
McSTRA: "Multi-head Cascaded Swin Transformers with Attention to k-space Sampling Pattern for Accelerated MRI Reconstruction", arXiv, 2022 (Monash University, Australia). [Paper]
?: "Colonoscopy Landmark Detection using Vision Transformers", arXiv, 2022 (Intuitive Surgical, CA). [Paper]
FedPR: "Learning Federated Visual Prompt in Null Space for MRI Reconstruction", CVPR, 2023 (A*STAR). [Paper][PyTorch]
?: "Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities", NeurIPS, 2023 (KU Leuven). [Paper][PyTorch]
?: "Brain encoding models based on multimodal transformers can transfer across language and vision", NeurIPS, 2023 (UT Austin). [Paper]
MinD-Video: "Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity", NeurIPS, 2023 (NUS). [Paper][PyTorch][Website]
SAX-NeRF: "Structure-Aware Sparse-View X-ray 3D Reconstruction", CVPR, 2024 (JHU). [Paper][PyTorch]

[Back to Overview]

Medical Low-Level Vision

Eformer: "Eformer: Edge Enhancement based Transformer for Medical Image Denoising", ICCV, 2021 (BITS Pilani, India). [Paper]
PTNet: "PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer", arXiv, 2021 (* Columbia *). [Paper]
ResViT: "ResViT: Residual vision transformers for multi-modal medical image synthesis", arXiv, 2021 (Bilkent University, Turkey). [Paper]
CyTran: "CyTran: Cycle-Consistent Transformers for Non-Contrast to Contrast CT Translation", arXiv, 2021 (University Politehnica of Bucharest, Romania). [Paper][PyTorch]
McMRSR: "Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution", CVPR, 2022 (Yantai University, China). [Paper][PyTorch]
RPLHR-CT: "RPLHR-CT Dataset and Transformer Baseline for Volumetric Super-Resolution from CT Scans", MICCAI, 2022 (Infervision Medical Technology, China). [Paper][Code (in construction)]
W-G2L-ART: "Wide Range MRI Artifact Removal with Transformers", BMVC, 2022 (KTH). [Paper]
RFormer: "RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark", arXiv, 2022 (Tsinghua). [Paper]
CTformer: "CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising", arXiv, 2022 (UMass Lowell). [Paper][PyTorch]
Cohf-T: "Cross-Modality High-Frequency Transformer for MR Image Super-Resolution", arXiv, 2022 (Xidian University). [Paper]
SIST: "Low-Dose CT Denoising via Sinogram Inner-Structure Transformer", arXiv, 2022 (?). [Paper]
Spach-Transformer: "Spach Transformer: Spatial and Channel-wise Transformer Based on Local and Global Self-attentions for PET Image Denoising", arXiv, 2022 (Harvard). [Paper]
ConvFormer: "ConvFormer: Combining CNN and Transformer for Medical Image Segmentation", arXiv, 2022 (University of Notre Dame). [Paper]
?: "Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers", ICCV, 2023 (Durham University, UK). [Paper]

[Back to Overview]

Medical Vision-Language

CGT: "Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation", CVPR, 2022 (University of Technology Sydney). [Paper]
MCGN: "A Medical Semantic-Assisted Transformer for Radiographic Report Generation", MICCAI, 2022 (University of Sydney). [Paper]
M3AE: "Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training", MICCAI, 2022 (CUHK). [Paper][PyTorch]
BioViL: "Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing", ECCV, 2022 (Microsoft). [Paper][Code]
MGCA: "Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning", NeurIPS, 2022 (HKU). [Paper]
MedCLIP: "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text", EMNLP, 2022 (UIUC). [Paper][PyTorch]
MDBERT: "Hierarchical BERT for Medical Document Understanding", arXiv, 2022 (IQVIA, NC). [Paper]
Surgical-VQA: "Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer", arXiv, 2022 (NUS). [Paper][PyTorch (in construction)]
SwinMLP-TranCAP: "Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches", arXiv, 2022 (CUHK). [Paper][PyTorch]
SAT: "Medical Image Captioning via Generative Pretrained Transformers", arXiv, 2022 (Philips Innovation Labs Rus, Russia). [Paper]
RepsNet: "RepsNet: Combining Vision with Language for Automated Medical Reports", arXiv, 2022 (Google). [Paper][Website]
MF²-MVQA: "MF²-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering", arXiv, 2022 (University of Science and Technology Beijing). [Paper]
RoentGen: "RoentGen: Vision-Language Foundation Model for Chest X-ray Generation", arXiv, 2022 (Stanford). [Paper]
?: "Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study", ICLR, 2023 (Sichuan University). [Paper]
METransformer: "METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens", CVPR, 2023 (University of Sydney). [Paper]
MI-Zero: "Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images", CVPR, 2023 (Harvard). [Paper]
KiUT: "KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation", CVPR, 2023 (Shanghai AI Lab). [Paper]
BioViL-T: "Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing", CVPR, 2023 (Microsoft). [Paper]
?: "Evidential Interactive Learning for Medical Image Captioning", ICML, 2023 (Rochester Institute of Technology, NY). [Paper]
PRIOR: "PRIOR: Prototype Representation Joint Learning from Medical Images and Reports", ICCV, 2023 (Southern University of Science and Technology). [Paper][Code (in construction)]
MedKLIP: "MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
PTUnifier: "Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts", ICCV, 2023 (CUHK). [Paper][PyTorch]
?: "Localized Questions in Medical Visual Question Answering", MICCAI, 2023 (University of Bern, Switzerland). [Paper]
CXR-CLIP: "CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training", MICCAI, 2023 (Kakao). [Paper]
LLaVA-Med: "LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day", NeurIPS (Datasets and Benchmarks), 2023 (Microsoft). [Paper][PyTorch]
Med-UniC: "Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias", NeurIPS, 2023 (OSU). [Paper][PyTorch]
EHRXQA: "EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images", NeurIPS (Datasets and Benchmarks), 2023 (KAIST). [Paper][Code]
Quilt: "Quilt-1M: One Million Image-Text Pairs for Histopathology", NeurIPS, 2023 (UW). [Paper]
RAMM: "RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training", arXiv, 2023 (Alibaba). [Paper]
PT: "Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models", arXiv, 2023 (University of Amsterdam). [Paper]
PMC-CLIP: "PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
Q2ATransformer: "Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder", arXiv, 2023 (The University of Sydney). [Paper]
PMC-VQA: "PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)][Website]
MedBLIP: "MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
GTGM: "Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation", arXiv, 2023 (USTC). [Paper]
XrayGPT: "XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models", arXiv, 2023 (MBZUAI). [Paper][PyTorch]
CONCH: "Towards a Visual-Language Foundation Model for Computational Pathology", arXiv, 2023 (Harvard). [Paper]
Med-Flamingo: "Med-Flamingo: a Multimodal Medical Few-shot Learner", arXiv, 2023 (Stanford). [Paper][PyTorch]
?: "Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis", arXiv, 2023 (Shanghai AI Lab). [Paper][GitHub]
CLIP-MUSED: "CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding", ICLR, 2024 (CAS). [Paper][PyTorch]
MAVL: "Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Matching Framework", CVPR, 2024 (University of Adelaide). [Paper][PyTorch]
FairCLIP: "FairCLIP: Harnessing Fairness in Vision-Language Learning", CVPR, 2024 (Harvard). [Paper]
RAD-DINO: "RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision", arXiv, 2024 (Microsoft). [Paper]
Med-Gemini: "Advancing Multimodal Medical Capabilities of Gemini", arXiv, 2024 (Google). [Paper]
EVA-X: "EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch]

[Back to Overview]

Medical Others

LAT: "Lesion-Aware Transformers for Diabetic Retinopathy Grading", CVPR, 2021 (USTC). [Paper]
UVT: "Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation", MICCAI, 2021 (ICL). [Paper][PyTorch]
?: "Surgical Instruction Generation with Transformers", MICCAI, 2021 (Bournemouth University, UK). [Paper]
AlignTransformer: "AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation", MICCAI, 2021 (Peking University). [Paper]
MCAT: "Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images", ICCV, 2021 (Harvard). [Paper][PyTorch]
?: "Is it Time to Replace CNNs with Transformers for Medical Images?", ICCVW, 2021 (KTH, Sweden). [Paper]
HAT-Net: "HAT-Net: A Hierarchical Transformer Graph Neural Network for Grading of Colorectal Cancer Histology Images", BMVC, 2021 (Beijing University of Posts and Telecommunications). [Paper]
?: "Federated Split Vision Transformer for COVID-19 CXR Diagnosis using Task-Agnostic Training", NeurIPS, 2021 (KAIST). [Paper]
ViT-Path: "Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology", NeurIPSW, 2021 (Microsoft). [Paper]
Global-Local-Transformer: "Global-Local Transformer for Brain Age Estimation", IEEE Transactions on Medical Imaging, 2021 (Harvard). [Paper][PyTorch]
CE-TFE: "Deep Transformers for Fast Small Intestine Grounding in Capsule Endoscope Video", arXiv, 2021 (Sun Yat-Sen University). [Paper]
DeepProg: "DeepProg: A Transformer-based Framework for Predicting Disease Prognosis", arXiv, 2021 (University of Oulu). [Paper]
Medical-Transformer: "Medical Transformer: Universal Brain Encoder for 3D MRI Analysis", arXiv, 2021 (Korea University). [Paper]
RATCHET: "RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting", arXiv, 2021 (ICL). [Paper]
C2FViT: "Affine Medical Image Registration with Coarse-to-Fine Vision Transformer", CVPR, 2022 (HKUST). [Paper][Code (in construction)]
HIPT: "Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning", CVPR, 2022 (Harvard). [Paper]
SiT: "Surface Analysis with Vision Transformers", CVPRW, 2022 (King’s College London, UK). [Paper][PyTorch]
SiT: "Surface Vision Transformers: Attention-Based Modelling applied to Cortical Analysis", Medical Imaging with Deep Learning (MIDL), 2022 (King’s College London, UK). [Paper]
ViT-V-Net: "ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration", ICML, 2022 (JHU). [Paper][PyTorch]
HybridStereoNet: "Deep Laparoscopic Stereo Matching with Transformers", MICCAI, 2022 (Monash University, Australia). [Paper][PyTorch]
BabyNet: "BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video", MICCAI, 2022 (Sano Centre for Computational Medicine, Poland). [Paper][PyTorch]
TLT: "Transformer Lesion Tracker", MICCAI, 2022 (InferVision Medical Technology, China). [Paper]
XMorpher: "XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention", MICCAI, 2022 (Southeast University, China). [Paper][PyTorch]
SVoRT: "SVoRT: Iterative Transformer for Slice-to-Volume Registration in Fetal Brain MRI", MICCAI, 2022 (MIT). [Paper]
GaitForeMer: "GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation", MICCAI, 2022 (Stanford). [Paper][PyTorch]
LKU-Net: "U-Net vs Transformer: Is U-Net Outdated in Medical Image Registration?", MICCAIW, 2022 (University of Birmingham, UK). [Paper]
LVOT: "Shifted Windows Transformers for Medical Image Quality Assessment", MICCAIW, 2022 (Istanbul Technical University, Turkey). [Paper]
MINiT: "Multiple Instance Neuroimage Transformer", MICCAIW, 2022 (Stanford). [Paper][Code (in construction)]
BrainNetTF: "Brain Network Transformer", NeurIPS, 2022 (Emory University). [Paper][PyTorch]
SiT: "Surface Vision Transformers: Flexible Attention-Based Modelling of Biomedical Surfaces", arXiv, 2022 (King’s College London, UK). [Paper][PyTorch]
TransMorph: "TransMorph: Transformer for unsupervised medical image registration", arXiv, 2022 (JHU). [Paper]
SymTrans: "Symmetric Transformer-based Nwholeetwork for Unsupervised Image Registration", arXiv, 2022 (Jilin University). [Paper]
MMT: "One Model to Synthesize Them All: Multi-contrast Multi-scale Transformer for Missing Data Imputation", arXiv, 2022 (JHU). [Paper]
EG-ViT: "Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning", arXiv, 2022 (Northwestern Polytechnical University). [Paper]
CSM: "Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection", arXiv, 2022 (University of Adelaide, Australia). [Paper]
CASHformer: "CASHformer: Cognition Aware SHape Transformer for Longitudinal Analysis", arXiv, 2022 (TUM). [Paper]
ARST: "ARST: Auto-Regressive Surgical Transformer for Phase Recognition from Laparoscopic Videos", arXiv, 2022 (Shanghai Jiao Tong University). [Paper]
SSiT: "SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading", arXiv, 2022 (Southern University of Science and Techonology, China). [Paper][Code (in construction)]
MulGT: "MulGT: Multi-task Graph-Transformer with Task-aware Knowledge Injection and Domain Knowledge-driven Pooling for Whole Slide Image Analysis", AAAI, 2023 (HKU). [Paper]
HVTSurv: "HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image", AAAI, 2023 (Tsinghua). [Paper][PyTorch]
AMIGO: "AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images", CVPR, 2023 (UBC). [Paper]
ACAT: "ACAT: Adversarial Counterfactual Attention for Classification and Detection in Medical Imaging", ICML, 2023 (University of Edinburgh, UK). [Paper]
ConSlide: "ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis", ICCV, 2023 (HKU). [Paper]
MOTCat: "Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction", ICCV, 2023 (HKUST). [Paper][PyTorch]
ViT-DAE: "ViT-DAE: Transformer-driven Diffusion Autoencoder for Histopathology Image Analysis", arXiv, 2023 (Stony Brook). [Paper]
LoRKD: "Low-Rank Knowledge Decomposition for Medical Foundation Models", CVPR, 2024 (SJTU). [Paper][Code (in construction)]

[Back to Overview]

Other Tasks

Active Learning:
- TJLS: "Visual Transformer for Task-aware Active Learning", arXiv, 2021 (ICL). [Paper][PyTorch]
Agriculture:
- PlantXViT: "Explainable vision transformer enabled convolutional neural network for plant disease identification: PlantXViT", arXiv, 2022 (Indian Institute of Information Technology). [Paper]
- MMST-ViT: "MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer", ICCV, 2023 (University of Delaware, Delaware). [Paper][PyTorch]
Aesthetic:
- CSKD: "CLIP Brings Better Features to Visual Aesthetics Learners", arXiv, 2023 (OPPO). [Paper]
- AesBench: "AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception", arXiv, 2024 (Xidian University). [Paper][GitHub]
Animation-related:
- AnT: "The Animation Transformer: Visual Correspondence via Segment Matching", ICCV, 2021 (Cadmium). [Paper]
- AniFormer: "AniFormer: Data-driven 3D Animation with Transformer", BMVC, 2021 (University of Oulu, Finland). [Paper][PyTorch]
Bird's Eye View (BEV):
- ViT-BEVSeg: "ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation", IJCNN, 2022 (Maynooth University, Ireland). [Paper][Code (in construction)]
- BEVFormer: "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
- CoBEVT: "CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers", CoRL, 2022 (UCLA). [Paper][PyTorch]
- GKT: "Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer", arXiv, 2022 (Huazhong University of Science and Technology). [Paper][Code (in construction)]
- BEVSegFormer: "BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs", WACV, 2023 (Nullmax, China). [Paper]
- BEVDistill: "BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection", ICLR, 2023 (USTC). [Paper][Code (in constrcution)]
- BEVFormer-v2: "BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision", CVPR, 2023 (Tsinghua University). [Paper]
- BEV-SAN: "BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks", CVPR, 2023 (Peking University). [Paper]
- BEVGuide: "BEV-Guided Multi-Modality Fusion for Driving Perception", CVPR, 2023 (UIUC). [Paper][Code (in construction)][Website]
- FB-OCC: "FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation", CVPRW, 2023 (NVIDIA). [Paper][Code (in construction)]
- FB-BEV: "FB-BEV: BEV Representation from Forward-Backward View Transformations", ICCV, 2023 (NVIDIA). [Paper][Code (in construction)]
- BEV-DG: "BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation", ICCV, 2023 (Xiamen University). [Paper]
- UniTR: "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation", ICCV, 2023 (Peking). [Paper][PyTorch]
- SparseBEV: "SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos", ICCV, 2023 (Nanjing University). [Paper][Code (in construction)]
- OCBEV: "OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection", arXiv, 2023 (Shanghai AI Lab). [Paper]
- FusionFormer: "FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection", arXiv, 2023 (Cainiao Network, China). [Paper]
- Talk2BEV: "Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving", arXiv, 2023 (IIIT Hyderabad). [Paper][Code][Website]
- SparseOcc: "Fully Sparse 3D Panoptic Occupancy Prediction", arXiv, 2023 (Shanghai AI Lab). [Paper]
- CLIP-BEVFormer: "CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow", CVPR, 2024 (Bosch). [Paper]
- TaDe: "Improving Bird's Eye View Semantic Segmentation by Task Decomposition", CVPR, 2024 (Wuhan University). [Paper][Code (in construction)]
Biology:
- ?: "A State-of-the-art Survey of Object Detection Techniques in Microorganism Image Analysis: from Traditional Image Processing and Classical Machine Learning to Current Deep Convolutional Neural Networks and Potential Visual Transformers", arXiv, 2021 (Northeastern University). [Paper]
Brain Score:
- CrossViT: "Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4", CVPRW, 2022 (MIT). [Paper][PyTorch]
Camera-related:
- CTRL-C: "CTRL-C: Camera calibration TRansformer with Line-Classification", ICCV, 2021 (Kakao + Kookmin University). [Paper][PyTorch]
- MS-Transformer: "Learning Multi-Scene Absolute Pose Regression with Transformers", ICCV, 2021 (Bar-Ilan University, Israel). [Paper][PyTorch]
- GTCaR: "GTCaR: Graph Transformer for Camera Re-localization", ECCV, 2022 (Magic Leap). [Paper]
- ?: "Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer", ICCV, 2023 (ANU). [Paper]
Change Detection:
- MapFormer: "MapFormer: Boosting Change Detection by Using Pre-change Information", ICCV, 2023 (LMU Munich). [Paper][PyTorch]
Character/Text Recognition:
- BTTR: "Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer", arXiv, 2021 (Peking). [Paper]
- TrOCR: "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models", arXiv, 2021 (Microsoft). [Paper][PyTorch]
- ?: "Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks", arXiv, 2021 (Salesforce). [Paper]
- T³: "TrueType Transformer: Character and Font Style Recognition in Outline Format", Document Analysis Systems (DAS), 2022 (Kyushu University). [Paper]
- ?: "Transformer-based HTR for Historical Documents", ComHum, 2022 (University of Zurich, Switzerland). [Paper]
- ?: "SVG Vector Font Generation for Chinese Characters with Transformer", ICIP, 2022 (The University of Tokyo). [Paper]
- LP-Transformer: "Forensic License Plate Recognition with Compression-Informed Transformers", ICIP, 2022 (University of Erlangen-Nurnberg, Germany). [Paper]
- CoMER: "CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition", ECCV, 2022 (Peking University). [Paper][PyTorch]
- MATRN: "Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features", ECCV, 2022 (KAIST). [Paper][PyTorch]
- CONSENT: "CONSENT: Context Sensitive Transformer for Bold Words Classification", arXiv, 2022 (Amazon). [Paper]
- DeepVecFont-v2: "DeepVecFont-v2: Exploiting Transformers to Synthesize Vector Fonts with Higher Quality", CVPR, 2023 (Peking University). [Paper][Code (in construction)]
- SVGformer: "SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers", CVPR, 2023 (Adobe). [Paper]
- SIGA: "Self-supervised Implicit Glyph Attention for Text Recognition", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
- LISTER: "LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- CCR-CLIP: "Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning", ICCV, 2023 (Fudan). [Paper][PyTorch]
- CLIPTER: "CLIPTER: Looking at the Bigger Picture in Scene Text Recognition", ICCV, 2023 (Amazon). [Paper]
- CLIP4STR: "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model", arXiv, 2023 (Zhejiang University). [Paper]
Curriculum Learning:
- SSTN: "Spatial Transformer Networks for Curriculum Learning", arXiv, 2021 (TU Kaiserslautern, Germany). [Paper]
Defect Classification:
- MSHViT: "Multi-Scale Hybrid Vision Transformer and Sinkhorn Tokenizer for Sewer Defect Classification", CVPRW, 2022 (Aalborg University, Denmark). [Paper]
- DefT: "Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
Digital Holography:
- ?: "Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography", ICCCR, 2022 (UBFC, France). [Paper]
Disentangled representation:
- VCT: "Visual Concepts Tokenization", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
E-Commerce:
- WebShop: "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents", NeurIPS, 2022 (Princeton). [Paper][PyTorch][Website]
- ECLIP: "Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce", CVPR, 2023 (ByteDance). [Paper]
Event data:
- EvT: "Event Transformer: A sparse-aware solution for efficient event data processing", arXiv, 2022 (Universidad de Zaragoza, Spain). [Paper][PyTorch]
- ETB: "Event Transformer", arXiv, 2022 (Nanjing University). [Paper]
- RVT: "Recurrent Vision Transformers for Object Detection with Event Cameras", CVPR, 2023 (University of Zurich). [Paper]
- Eventful-Transformer: "Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers", ICCV, 2023 (UW Madison). [Paper][PyTorch][Website]
- GET: "GET: Group Event Transformer for Event-Based Vision", ICCV, 2023 (USTC). [Paper][PyTorch]
- ?: "Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers", ICCV, 2023 (CUHK). [Paper][PyTorch]
- SODFormer: "SODFormer: Streaming Object Detection with Transformer Using Events and Frames", TPAMI, 2023 (Peking). [Paper][PyTorch]
- EventSAM: "Segment Any Events via Weighted Adaptation of Pivotal Tokens", arXiv, 2023 (Xidian University). [Paper][PyTorch (in construction)]
Fashion:
- Kaleido-BERT: "Kaleido-BERT: Vision-Language Pre-training on Fashion Domain", CVPR, 2021 (Alibaba). [Paper][Tensorflow]
- CIT: "Cloth Interactive Transformer for Virtual Try-On", arXiv, 2021 (University of Trento). [Paper][Code (in construction)]
- ClothFormer: "ClothFormer: Taming Video Virtual Try-on in All Module", CVPR, 2022 (iQIYI). [Paper][Website]
- FashionVLP: "FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback", CVPR, 2022 (Amazon). [Paper]
- FashionViL: "FashionViL: Fashion-Focused Vision-and-Language Representation Learning", ECCV, 2022 (University of Surrey, UK). [Paper][PyTorch]
- OutfitTransformer: "OutfitTransformer: Learning Outfit Representations for Fashion Recommendation", arXiv, 2022 (Amazon). [Paper]
- FaD-VLP: "FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning", EMNLP, 2022 (Meta). [Paper]
- Fashionformer: "Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition", ECCV, 2022 (Peking). [Paper][PyTorch]
- FAME-ViL: "FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks", CVPR, 2023 (University of Surrey). [Paper][PyTorch]
- FashionSAP: "FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training", CVPR, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
- OpenFashionCLIP: "OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data", ICIAP, 2023 (UniMoRE, Italy). [Paper][PyTorch]
- MVLT: "Masked Vision-Language Transformer in Fashion", Machine Intelligence Research, 2023 (Alibaba). [Paper][PyTorch]
- UniDiff: "UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning", arXiv, 2023 (Sun Yat-Sen University). [Paper]
Feature Matching:
- SuperGlue: "SuperGlue: Learning Feature Matching with Graph Neural Networks", CVPR, 2020 (Magic Leap). [Paper][PyTorch]
- LoFTR: "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR, 2021 (Zhejiang University). [Paper][PyTorch][Website]
- COTR: "COTR: Correspondence Transformer for Matching Across Images", ICCV, 2021 (UBC). [Paper]
- CATs: "CATs: Cost Aggregation Transformers for Visual Correspondence", NeurIPS, 2021 (Yonsei University + Korea University). [Paper][PyTorch][Website]
- TransforMatcher: "TransforMatcher: Match-to-Match Attention for Semantic Correspondence", CVPR, 2022 (POSTECH). [Paper]
- ASpanFormer: "ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer", ECCV, 2022 (HKUST). [Paper][Website]
- CATs++: "CATs++: Boosting Cost Aggregation with Convolutions and Transformers", arXiv, 2022 (Korea University). [Paper]
- LoFTR-TensorRT: "Local Feature Matching with Transformers for low-end devices", arXiv, 2022 (?). [Paper][PyTorch]
- MatchFormer: "MatchFormer: Interleaving Attention in Transformers for Feature Matching", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
- OpenGlue: "OpenGlue: Open Source Graph Neural Net Based Pipeline for Image Matching", arXiv, 2022 (Ukrainian Catholic University). [Paper][PyTorch]
- ParaFormer: "ParaFormer: Parallel Attention Transformer for Efficient Feature Matching", AAAI, 2023 (Southeast University, China). [Paper]
- ASTR: "Adaptive Spot-Guided Transformer for Consistent Local Feature Matching", CVPR, 2023 (USTC). [Paper][Website]
- ACTR: "Correspondence Transformers with Asymmetric Feature Learning and Matching Flow Super-Resolution", CVPR, 2023 (Fudan). [Paper][Code (in construction)]
- D²Former: "D²Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers", CVPR, 2023 (USTC). [Paper]
- PMatch: "PMatch: Paired Masked Image Modeling for Dense Geometric Matching", CVPR, 2023 (Michigan State). [Paper][Code (in construction)]
- 2D3D-MATR: "2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds", ICCV, 2023 (National University of Defense Technology, China). [Paper][PyTorch (in construction)]
- CasMTR: "Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints", ICCV, 2023 (Fudan). [Paper][PyTorch]
- Fuse-ViT: "A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence", NeurIPS, 2023 (Google). [Paper][Website]
- Diffusion-Hyperfeature: "Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
- LDM-correspondence: "Unsupervised Semantic Correspondence Using Stable Diffusion", NeurIPS, 2023 (UBC). [Paper][PyTorch][Website]
- VSFormer: "VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning", AAAI, 2024 (Wenzhou University). [Paper][Code (in construction)]
- Efficient-LoFTR: "Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed", CVPR, 2024 (Zhejiang). [Paper][Code (in construction)][Website]
- OmniGlue: "OmniGlue: Generalizable Feature Matching with Foundation Model Guidance", CVPR, 2024 (Google). [Paper][Tensorflow][Website]
Fine-grained:
- ViT-FGVC: "Exploring Vision Transformers for Fine-grained Classification", CVPRW, 2021 (Universidad de Valladolid). [Paper]
- FFVT: "Feature Fusion Vision Transformer for Fine-Grained Visual Categorization", BMVC, 2021 (Griffith University, Australia). [Paper][PyTorch]
- TPSKG: "Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition", arXiv, 2021 (Beihang University). [Paper]
- AFTrans: "A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition", arXiv, 2021 (Peking University). [Paper]
- TransFG: "TransFG: A Transformer Architecture for Fine-grained Recognition", AAAI, 2022 (Johns Hopkins). [Paper][PyTorch]
- DynamicMLP: "Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information", CVPR, 2022 (Megvii). [Paper][PyTorch]
- SIM-Trans: "SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization", ACMMM, 2022 (Peking University). [Paper][PyTorch]
- MetaFormer: "MetaFormer: A Unified Meta Framework for Fine-Grained Recognition", arXiv, 2022 (ByteDance). [Paper][PyTorch]
- ViT-FOD: "ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator", arXiv, 2022 (Shandong University). [Paper]
- PLEor: "Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator", CVPR, 2023 (Dalian University of Technology). [Paper]
- MultitaskVLFM: "Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks", arXiv, 2023 (Conservatoire National des Arts et Métiers (CEDRIC) France). [Paper][PyTorch]
- M2Former: "M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition", arXiv, 2023 (Dongguk University, Korea). [Paper]
- MP-FGVC: "Delving into Multimodal Prompting for Fine-grained Visual Classification", arXiv, 2023 (Nanjing University of Science and Technology). [Paper]
- HGCLIP: "HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding", arXiv, 2023 (Monash). [Paper][PyTorch]
- FineR: "Democratizing Fine-grained Visual Recognition with Large Language Models", ICLR, 2024 (University of Trento). [Paper][Code (in construction)][Website]
- Finer: "Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models", arXiv, 2024 (UIUC). [Paper]
Gait:
- Gait-TR: "Spatial Transformer Network on Skeleton-based Gait Recognition", arXiv, 2022 (South China University of Technology). [Paper]
- MMGaitFormer: "Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion", CVPR, 2023 (Beihang University). [Paper]
Gaze:
- GazeTR: "Gaze Estimation using Transformer", arXiv, 2021 (Beihang University). [Paper][PyTorch]
- HGTTR: "End-to-End Human-Gaze-Target Detection with Transformers", CVPR, 2022 (Shanghai Jiao Tong). [Paper]
- MGTR: "MGTR: End-to-End Mutual Gaze Detection with Transformer", ACCV, 2022 (Nankai University). [Paper][PyTorch]
- GLC: "In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation", arXiv, 2022 (Georgia Tech). [Paper][Website]
- Gazeformer: "Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention", CVPR, 2023 (Stony Brook). [Paper][PyTorch]
- Sharingan: "Sharingan: A Transformer-based Architecture for Gaze Following", arXiv, 2023 (EPFL). [Paper]
- TransGOP: "TransGOP: Transformer-Based Gaze Object Prediction", AAAI, 2024 (Xi'an University of Architecture and Technology). [Paper]
- CLIP-Gaze: "CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model", AAAI, 2024 (Hikvision). [Paper]
- IG: "Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition", CVPR, 2024 (Sun Yat-sen University). [Paper][Website]
Geo-Localization:
- EgoTR: "Cross-view Geo-localization with Evolving Transformer", arXiv, 2021 (Shenzhen University). [Paper]
- TransGeo: "TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization", CVPR, 2022 (UCF). [Paper][PyTorch]
- GAMa: "GAMa: Cross-view Video Geo-localization", ECCV, 2022 (UCF). [Paper][Code (in construction)]
- TransLocator: "Where in the World is this Image? Transformer-based Geo-localization in the Wild", ECCV, 2022 (JHU). [Paper]
- TransGCNN: "Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization", arXiv, 2022 (Southeast University, China). [Paper]
- MGTL: "Mutual Generative Transformer Learning for Cross-view Geo-localization", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
- GeoGuessNet: "Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes", CVPR, 2023 (UCF). [Paper][PyTorch (in construction)]
- GeoCLIP: "GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization", NeurIPS, 2023 (UCF). [Paper]
Homography Estimation:
- LocalTrans: "LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation", ICCV, 2021 (Tsinghua). [Paper]
Image Registration:
- AiR: "Attention for Image Registration (AiR): an unsupervised Transformer approach", arXiv, 2021 (INRIA). [Paper]
Image Retrieval:
- RRT: "Instance-level Image Retrieval using Reranking Transformers", ICCV, 2021 (University of Virginia). [Paper][PyTorch]
- SwinFGHash: "SwinFGHash: Fine-grained Image Retrieval via Transformer-based Hashing Network", BMVC, 2021 (Tsinghua). [Paper]
- ViT-Retrieval: "Investigating the Vision Transformer Model for Image Retrieval Tasks", arXiv, 2021 (Democritus University of Thrace). [Paper]
- IRT: "Training Vision Transformers for Image Retrieval", arXiv, 2021 (Facebook + INRIA). [Paper]
- TransHash: "TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval", arXiv, 2021 (Shanghai Jiao Tong University). [Paper]
- VTS: "Vision Transformer Hashing for Image Retrieval", arXiv, 2021 (IIIT-Allahabad). [Paper]
- GTZSR: "Zero-Shot Sketch Based Image Retrieval using Graph Transformer", arXiv, 2022 (IIT Bombay). [Paper]
- EViT: "EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing", arXiv, 2022 (Jinan University). [Paper][PyTorch (in construction)]
- ?: "Transformers and CNNs both Beat Humans on SBIR", arXiv, 2022 (University of Mons, Belgium). [Paper]
- DToP: "Boosting vision transformers for image retrieval", WACV, 2023 (Dealicious, Korea). [Paper][Code (in construction)]
- ?: "A Light Touch Approach to Teaching Transformers Multi-view Geometry", CVPR, 2023 (Oxford). [Paper]
- IRGen: "IRGen: Generative Modeling for Image Retrieval", arXiv, 2023 (Microsoft). [Paper]
- CIReVL: "Vision-by-Language for Training-Free Compositional Image Retrieval", arXiv, 2023 (University of Tübingen, Germany). [Paper]
Layout Generation:
- VTN: "Variational Transformer Networks for Layout Generation", CVPR, 2021 (Google). [Paper]
- LayoutTransformer: "LayoutTransformer: Scene Layout Generation With Conceptual and Spatial Diversity", CVPR, 2021 (NTU). [Paper][PyTorch]
- LayoutTransformer: "LayoutTransformer: Layout Generation and Completion with Self-attention", ICCV, 2021 (Amazon). [Paper][Website]
- LGT-Net: "LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
- CADTransformer: "CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings", CVPR, 2022 (UT Austin). [Paper]
- GAT-CADNet: "GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD Drawings", CVPR, 2022 (TUM + Alibaba). [Paper]
- LayoutBERT: "LayoutBERT: Masked Language Layout Model for Object Insertion", CVPRW, 2022 (Adobe). [Paper]
- ICVT: "Geometry Aligned Variational Transformer for Image-conditioned Layout Generation", ACMMM, 2022 (Alibaba). [Paper]
- BLT: "BLT: Bidirectional Layout Transformer for Controllable Layout Generation", ECCV, 2022 (Google). [Paper][Tensorflow][Website]
- ATEK: "ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis", arXiv, 2022 (New Jersey Institute of Technology). [Paper]
- ?: "Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades", arXiv, 2022 (Simon Fraser). [Paper]
- LayoutFormer++: "LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction", CVPR, 2023 (Microsoft). [Paper]
- RoomFormer: "Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries", CVPR, 2023 (ETH Zurich). [Paper][PyTorch][Website]
- LayoutDM: "LayoutDM: Transformer-based Diffusion Model for Layout Generation", CVPR, 2023 (USTC). [Paper]
- DLT: "DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer", ICCV, 2023 (Wix.com). [Paper]
Livestock Monitoring:
- STARFormer: "Livestock Monitoring with Transformer", BMVC, 2021 (IIT Dhanbad). [Paper]
Metric Learning:
- Hyp-ViT: "Hyperbolic Vision Transformers: Combining Improvements in Metric Learning", CVPR, 2022 (University of Trento, Italy). [Paper][PyTorch]
- BGFormer: "Rethinking Batch Sample Relationships for Data Representation: A Batch-Graph Transformer based Approach", arXiv, 2022 (Anhui University). [Paper]
- ?: "Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning", CVPR, 2023 (LMU Munich). [Paper]
Multi-Input:
- MixViT: "Adapting Multi-Input Multi-Output schemes to Vision Transformers", CVPRW, 2022 (Sorbonne Universite, France). [Paper]
Multi-label:
- C-Tran: "General Multi-label Image Classification with Transformers", CVPR, 2021 (University of Virginia). [Paper]
- TDRG: "Transformer-Based Dual Relation Graph for Multi-Label Image Recognition", ICCV, 2021 (Tencent). [Paper]
- MlTr: "MlTr: Multi-label Classification with Transformer", arXiv, 2021 (KuaiShou). [Paper]
- GATN: "Graph Attention Transformer Network for Multi-Label Image Classification", arXiv, 2022 (Southeast University, China). [Paper]
- CDUL: "CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification", ICCV, 2023 (University of Southern Mississippi, Mississippi). [Paper]
- TagCLIP: "TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training", AAAI, 2024 (Zhejiang). [Paper][PyTorch]
Multi-task:
- MulT: "MulT: An End-to-End Multitask Learning Transformer", CVPR, 2022 (EPFL). [Paper]
- UFO: "UFO: Unified Feature Optimization", ECCV, 2022 (Baidu). [Paper][PaddlePaddle]
- Painter: "Images Speak in Images: A Generalist Painter for In-Context Visual Learning", CVPR, 2023 (BAAI). [Paper][Code (in construction)]
- MTLoRA: "MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning", CVPR, 2024 (Brown). [Paper][PyTorch]
Open Set:
- OSR-ViT: "Open Set Recognition using Vision Transformer with an Additional Detection Head", arXiv, 2022 (Vanderbilt University, Tennessee). [Paper]
Operator Learning for PDEs:
- Galerkin Transformer: "Choose a Transformer: Fourier or Galerkin", NeurIPS, 2021 (Washington University, St. Louis). [Paper][PyTorch]
- Coupled Attention: "Learning operators with coupled attention", JMLR, 2022 (University of Pennsylvania). [Paper]
- HT-Net: "HT-Net: Hierarchical Transformer based Operator Learning Model for Multiscale PDEs", arXiv, 2022 (KAUST). [Paper]
- Relative-PE: "Transformer for Partial Differential Equations' Operator Learning", arXiv, 2022 (CMU). [Paper]
Out-Of-Distribution (OOD):
- OODformer: "OODformer: Out-Of-Distribution Detection Transformer", BMVC, 2021 (LMU Munich). [Paper][PyTorch]
- MCM: "Delving into Out-of-Distribution Detection with Vision-Language Representations", NeurIPS, 2022 (UW-Madison). [Paper]
- MOOD: "Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need", CVPR, 2023 (CUHK). [Paper][PyTorch]
- ?: "Masked Images Are Counterfactual Samples for Robust Fine-tuning", CVPR, 2023 (Sun Yat-sen University). [Paper][PyTorch]
- CLIPood: "CLIPood: Generalizing CLIP to Out-of-Distributions", ICML, 2023 (Tsinghua). [Paper]
- CLIPN: "CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No", ICCV, 2023 (HKUST). [Paper][PyTorch]
- ?: "Distilling Large Vision-Language Model with Out-of-Distribution Generalizability", ICCV, 2023 (UCSD). [Paper][PyTorch]
- DREAM-OOD: "Dream the Impossible: Outlier Imagination with Diffusion Models", NeurIPS, 2023 (UW Madison). [Paper]
- LoCoOp: "LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning", NeurIPS, 2023 (The University of Tokyo). [Paper][PyTorch]
- ?: "A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)", NeurIPS, 2023 (ANU). [Paper]
- GL-MCM: "Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models", arXiv, 2023 (The University of Tokyo). [Paper]
- CLIP-OOD: "Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?", arXiv, 2023 (University of Tübingen). [Paper]
- MOODv2: "MOODv2: Masked Image Modeling for Out-of-Distribution Detection", arXiv, 2024 (CUHK). [Paper]
- AutoFT: "AutoFT: Robust Fine-Tuning by Optimizing Hyperparameters on OOD Data", arXiv, 2024 (Stanford). [Paper]
Pedestrian Intention:
- IntFormer: "IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture", arXiv, 2021 (Universidad de Alcala). [Paper]
Physics Simulation:
- TIE: "Transformer with Implicit Edges for Particle-based Physics Simulation", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
Place Recognition:
- SVT-Net: "SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers", AAAI, 2022 (Renmin University of China). [Paper]
- TransVPR: "TransVPR: Transformer-based place recognition with multi-level attention aggregation", CVPR, 2022 (Xi'an Jiaotong). [Paper]
- OverlapTransformer: "OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition", IROS, 2022 (HAOMO.AI, China). [Paper][PyTorch]
- SeqOT: "SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data", arXiv, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
- R²Former: "R²Former: Unified Retrieval and Reranking Transformer for Place Recognition", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
- BoQ: "BoQ: A Place is Worth a Bag of Learnable Queries", CVPR, 2024 (Universite Laval, Canada). [Paper][Code (in construction)]
Remote Sensing/Hyperspectral/Satellite:
- DCFAM: "Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images", arXiv, 2021 (Wuhan University). [Paper]
- WiCNet: "Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images", arXiv, 2021 (University of Trento). [Paper]
- ?: "Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images", arXiv, 2021 (University of Orleans, France). [Paper]
- Satellite-ViT: "Manipulation Detection in Satellite Images Using Vision Transformer", arXiv, 2021 (Purdue). [Paper]
- ?: "Self-supervised Vision Transformers for Joint SAR-optical Representation Learning", IGARSS, 2022 (German Aerospace Center). [Paper]
- VBFusion: "Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing", SPIE Remote Sensing, 2022 (Technische Universitat Berlin, Germany). [Paper][PyTorch]
- SatMAE: "SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery", NeurIPS, 2022 (Stanford). [Paper]
- ANDT: "Anomaly Detection in Aerial Videos with Transformers", IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2022 (TUM). [Paper]
- RNGDet: "RNGDet: Road Network Graph Detection by Transformer in Aerial Images", arXiv, 2022 (HKUST). [Paper]
- FSRA: "A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization", arXiv, 2022 (China Jiliang University). [Paper][PyTorch]
- ?: "Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Imag (e Cl)assificationtion", arXiv, 2022 (Shenzhen University). [Paper]
- ?: "Deep Hyperspectral Unmixing using Transformer Network", arXiv, 2022 (Jalpaiguri Engineering College, India). [Paper]
- SiamixFormer: "SiamixFormer: A Siamese Transformer Network For Building Detection And Change Detection From Bi-Temporal Remote Sensing Images", arXiv, 2022 (Tarbiat Modares University, Iran). [Paper]
- DAHiTrA: "DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture", arXiv, 2022 (Simon Fraser University, Canada). [Paper]
- RVSA: "Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model", arXiv, 2022 (Wuhan University + The University of Sydney). [Paper]
- SatViT: "Transfer Learning with Pretrained Remote Sensing Transformers", arXiv, 2022 (?). [Paper][PyTorch]
- FTN: "Fully Transformer Network for Change Detection of Remote Sensing Images", arXiv, 2022 (Dalian University of Technology). [Paper]
- MCTNet: "MCTNet: A Multi-Scale CNN-Transformer Network for Change Detection in Optical Remote Sensing Images", arXiv, 2022 (Tsinghua University). [Paper]
- ?: "Transformers For Recognition In Overhead Imagery: A Reality Check", arXiv, 2022 (Duke University). [Paper]
- TSViT: "ViTs for SITS: Vision Transformers for Satellite Image Time Series", CVPR, 2023 (ICL). [Paper][PyTorch]
- MethaneMapper: "MethaneMapper: Spectral Absorption aware Hyperspectral Transformer for Methane Detection", CVPR, 2023 (UCSB). [Paper]
- GFM: "Towards Geospatial Foundation Models via Continual Pretraining", ICCV, 2023 (Amazon). [Paper][PyTorch (in construction)]
- Scale-MAE: "Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning", ICCV, 2023 (Berkeley). [Paper]
- SAMRS: "SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model", NeurIPS (Datasets and Benchmarks), 2023 (iFlytek, China). [Paper][PyTorch]
- RS5M: "RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model", arXiv, 2023 (Zhejiang University). [Paper][Code (in construction)]
- RSGPT: "RSGPT: A Remote Sensing Vision Language Model and Benchmark", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
- EarthGPT: "EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain", arXiv, 2024 (Beijing Institute of Technology). [Paper]
- AnyChange: "Segment Any Change", arXiv, 2024 (Stanford). [Paper]
- MMEarth: "MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning", arXiv, 2024 (University of Copenhagen, Denmark). [Paper][PyTorch][Dataset][Website]
Robotics:
- TF-Grasp: "When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection", arXiv, 2022 (University of Science and Technology of China). [Paper][Code (in construction)]
- BeT: "Behavior Transformers: Cloning k modes with one stone", arXiv, 2022 (NYU). [Paper][PyTorch]
- Perceiver-Actor: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", Conference on Robot Learning (CoRL), 2022 (NVIDIA). [Paper][Website]
- PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", arXiv, 2022 (Microsoft). [Paper]
- ?: "A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
- ?: "Grounding Language with Visual Affordances over Unstructured Data", arXiv, 2022 (University of Freiburg, Germany). [Paper][Website]
- ?: "Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation", ICLR, 2023 (DeepMind). [Paper]
- LOCATE: "LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding", CVPR, 2023 (University of Edinburgh, UK). [Paper][PyTorch][Website]
- Afformer: "Affordance Grounding from Demonstration Video to Target Image", CVPR, 2023 (NUS). [Paper][PyTorch]
- MV-MWM: "Multi-View Masked World Models for Visual Robotic Manipulation", ICML, 2023 (KAIST). [Paper][Tensorflow2][Website]
- MTM: "Masked Trajectory Models for Prediction, Representation, and Control", ICML, 2023 (Meta). [Paper][PyTorch][Website]
- Skill-Transformer: "Skill Transformer: A Monolithic Policy for Mobile Manipulation", ICCV, 2023 (Georgia Tech). [Paper]
- RUPs: "Nonrigid Object Contact Estimation With Regional Unwrapping Transformer", ICCV, 2023 (Southeast University, China). [Paper]
- IAG: "Grounding 3D Object Affordance from 2D Interactions in Images", ICCV, 2023 (USTC). [Paper][Website][PyTorch]
- RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 (NVIDIA). [Paper][PyTorch][Website]
- M2T2: "M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place", CoRL, 2023 (NVIDIA). [Paper][PyTorch][Website]
- ?: "Humanoid Locomotion as Next Token Prediction", arXiv, 2024 (Berkeley). [Paper]
Scene Decomposition:
- SRT: "Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations", CVPR, 2022 (Google). [Paper][PyTorch (stelzner)][Website]
- OSRT: "Object Scene Representation Transformer", NeurIPS, 2022 (Google). [Paper][Website]
- Prompter: "Prompter: Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following", arXiv, 2022 (Hitachi). [Paper]
- RePAST: "RePAST: Relative Pose Attention Scene Representation Transformer", arXiv, 2023 (Google). [Paper]
- GTA: "GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers", arXiv, 2023 (University of Tubingen). [Paper]
Scene Text Recognition:
- ViTSTR: "Vision Transformer for Fast and Efficient Scene Text Recognition", ICDAR, 2021 (University of the Philippines). [Paper]
- STKM: "Self-attention based Text Knowledge Mining for Text Detection", CVPR, 2021 (?). [Paper][Code (in construction)]
- I2C2W: "I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition", arXiv, 2021 (NTU Singapoer). [Paper]
- CornerTransformer: "Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
- CUTE: "Contextual Text Block Detection towards Scene Text Understanding", ECCV, 2022 (NTU Singapore). [Paper][Website]
- PARSeq: "Scene Text Recognition with Permuted Autoregressive Sequence Models", ECCV, 2022 (University of the Philippines). [Paper][PyTorch]
- PTIE: "Pure Transformer with Integrated Experts for Scene Text Recognition", ECCV, 2022 (NTU Singapore). [Paper]
- MGP-STR: "Multi-Granularity Prediction for Scene Text Recognition", ECCV, 2022 (Alibaba). [Paper]
- VLAMD: "Vision-Language Adaptive Mutual Decoder for OOV-STR", ECCVW, 2022 (iFLYTEK, China). [Paper]
- MVLT: "Masked Vision-Language Transformers for Scene Text Recognition", BMVC, 2022 (Westone Information Industry Inc., China). [Paper][PyTorch]
Sign Language:
- LWTA: "Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation", ICCV, 2021 (Cyprus University of Technology). [Paper]
- CiCo: "CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
- GFSLT-VLP: "Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining", ICCV, 2023 (Macau University of Science and Technology (MUST)). [Paper][Code (in construction)]
- IP-SLT: "Sign Language Translation with Iterative Prototype", ICCV, 2023 (USTC). [Paper]
- SignBERT+: "SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding", TPAMI, 2023 (USTC). [Paper][Website]
- Sign2GPT: "Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation", ICLR, 2024 (University of Surrey). [Paper]
- SignLLM: "SignLLM: Sign Languages Production Large Language Models", arXiv, 2024 (Rutgers). [Paper][Website]
Spike:
- Spikformer: "Spikformer: When Spiking Neural Network Meets Transformer", arXiv, 2022 (Peking). [Paper]
- SDSA: "Spike-driven Transformer", NeurIPS, 2023 (CAS). [Paper][PyTorch]
- Meta-SpikeFormer: "Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips", ICLR, 2024 (CAS). [Paper][PyTorch]
Stereo:
- STTR: "Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers", ICCV, 2021 (Johns Hopkins). [Paper][PyTorch]
- PS-Transformer: "PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism", BMVC, 2021 (National Institute of Informatics, JAPAN). [Paper][PyTorch]
- ChiTransformer: "ChiTransformer: Towards Reliable Stereo from Cues", CVPR, 2022 (GSU). [Paper]
- TransMVSNet: "TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers", CVPR, 2022 (Megvii). [Paper][Code (in construction)]
- MVSTER: "MVSTER: Epipolar Transformer for Efficient Multi-View Stereo", ECCV, 2022 (CAS). [Paper][PyTorch]
- CEST: "Context-Enhanced Stereo Transformer", ECCV, 2022 (CAS). [[Paper](Context-Enhanced Stereo Transformer)][PyTorch]
- WT-MVSNet: "WT-MVSNet: Window-based Transformers for Multi-view Stereo", NeurIPS, 2022 (Tsinghua University). [Paper]
- MVSFormer: "MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo", arXiv, 2022 (Fudan University). [Paper]
- MVSFormer++: "MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo", ICLR, 2024 (Fudan). [Paper][Code (in construction)]
Tactile:
- UniTouch: "Binding Touch to Everything: Learning Unified Multimodal Tactile Representations", arXiv, 2024 (Yale). [Paper][Code (in construction)][Website]
Time Series:
- MissFormer: "MissFormer: (In-)attention-based handling of missing observations for trajectory filtering and prediction", arXiv, 2021 (Fraunhofer IOSB, Germany). [Paper]
Traffic:
- NEAT: "NEAT: Neural Attention Fields for End-to-End Autonomous Driving", ICCV, 2021 (MPI). [Paper][PyTorch]
- ViTAL: "Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder", IV, 2021 (Technische Hochschule Ingolstadt). [Paper]
- ?: "Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information", IVS, 2021 (Universidad de Alcala). [Paper]
- ?: "Translating Images into Maps", ICRA, 2022 (University of Surrey, UK). [Paper][PyTorch (in construction)]
- Crossview-Transformer: "Cross-view Transformers for real-time Map-view Semantic Segmentation", CVPR, 2022 (UT Austin). [Paper][PyTorch]
- MSF3DDETR: "MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving", ICPRW, 2022 (University of Coimbra, Portugal). [Paper]
- TransLPC: "Transformers for Object Detection in Large Point Clouds", ITSC, 2022 (Bosch). [Paper]
- PicT: "PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification", ACMMM, 2022 (Chongqing University). [Paper][PyTorch (in construction)]
- JPerceiver: "JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
- V2X-ViT: "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer", ECCV, 2022 (UCLA). [Paper]
- ?: "Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects?", IROSW, 2022 (Bosch). [Paper]
- MTR: "Motion Transformer with Global Intention Localization and Local Movement Refinement", NeurIPS, 2022 (MPI). [Paper][Code (in construction)]
- PlanT: "PlanT: Explainable Planning Transformers via Object-Level Representations", Conference on Robot Learning (CoRL), 2022 (TUM). [Paper][PyTorch][Website]
- ParkPredict+: "ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer", arXiv, 2022 (Berkeley). [Paper]
- ?: "Pyramid Transformer for Traffic Sign Detection", arXiv, 2022 (Iran University of Science and Technology). [Paper]
- STrajNet: "STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer", arXiv, 2022 (NTU, Singapore). [Paper]
- MTPP: "Multi-modal Transformer Path Prediction for Autonomous Vehicle", arXiv, 2022 (National Central University). [Paper]
- DCT: "A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View", arXiv, 2022 (Gwang-ju Institute of Science and Technology). [Paper]
- C-ViT: "Traffic Accident Risk Forecasting using Contextual Vision Transformers", arXiv, 2022 (University of Technology Sydney). [Paper]
- MapTR: "MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction", ICLR, 2023 (Horizon Robotics). [Paper][PyTorch]
- VE-Prompt: "Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving", CVPR, 2023 (Sun Yat-sen University). [Paper]
- TPVFormer: "Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction", CVPR, 2023 (Tsinghua University). [Paper][PyTorch]
- TBP-Former: "TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving", CVPR, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
- BAEFormer: "BAEFormer: Bi-directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation", CVPR, 2023 (Horizon Robotics). [Paper]
- BAAM: "BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling", CVPR, 2023 (Chungnam National University, Korea). [Paper][PyTorch]
- Pix2Map: "Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images", CVPR, 2023 (CMU). [Paper][Website]
- UniAD: "Planning-oriented Autonomous Driving", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
- Multiverse-Transformer: "Multiverse Transformer: 1st Place Solution for Waymo Open Sim Agents Challenge 2023", CVPRW, 2023 (Pegasus). [Paper][Website]
- UniFormer: "UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View", ICCV, 2023 (Zhejiang University). [Paper]
- SegMiF: "Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation", ICCV, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
- VTD: "Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving", ICCV, 2023 (ETHZ). [Paper]
- HM-ViT: "HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with vision transformer", ICCV, 2023 (UCLA). [Paper][Code (in construction)]
- UP-VL: "Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving", ICCV, 2023 (Waymo). [Paper]
- GameFormer: "GameFormer: Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- GeoMIM: "GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding", ICCV, 2023 (CUHK). [Paper][PyTorch]
- LiDARFormer: "LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception", arXiv, 2023 (TuSimple). [Paper]
- VoxelFormer: "VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
- LCTGen: "Language Conditioned Traffic Generation", arXiv, 2023 (NVIDIA). [Paper][Website]
- UniWorld: "UniWorld: Autonomous Driving Pre-training via World Models", arXiv, 2023 (Peking). [Paper][Code (in construction)]
- PromptTrack: "Language Prompt for Autonomous Driving", arXiv, 2023 (Megvii). [Paper]
- HiLM-D: "HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving", arXiv, 2023 (Huawei). [Paper]
- DiffPrompter: "DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions", arXiv, 2023 (IIIT Hyderabad). [Paper][PyTorch][Website]
- OccWorld: "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving", arXiv, 2023 (Tsinghua). [Paper][PyTorch][Website]
- VehicleMAE: "Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception", AAAI, 2024 (Anhui University). [Paper]
- STT: "STT: Stateful Tracking with Transformers for Autonomous Driving", ICRA, 2024 (Waymo). [Paper]
- MM-AU: "Abductive Ego-View Accident Video Understanding for Safe Driving Perception", CVPR, 2024 (Xi'an Jiaotong University). [Paper][Website]
- GenAD: "Generalized Predictive Model for Autonomous Driving", CVPR, 2024 (Shanghai AI Lab). [Paper][Code (in construction)]
- DriveWorld: "DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving", CVPR, 2024 (Peking). [Paper]
Traffic (LLM-based):
- AVIS: "AVIS: Autonomous Visual Information Seeking with Large Language Models", NeurIPS, 2023 (Google). [Paper]
- DriveGPT4: "DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model", arXiv, 2023 (HKU). [Paper][Website]
- ?: "Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving", arXiv, 2023 (Wayve). [Paper][PyTorch]
- GPT4V-AD: "On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- Agent-Driver: "A Language Agent for Autonomous Driving", arXiv, 2023 (USC). [Paper][Code (in construction)][Website]
- ADriver-I: "ADriver-I: A General World Model for Autonomous Driving", arXiv, 2023 (Megvii). [Paper]
- Dolphins: "Dolphins: Multimodal Language Model for Driving", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)][Website]
- LMDrive: "LMDrive: Closed-Loop End-to-End Driving with Large Language Models", arXiv, 2023 (CUHK). [Paper]
- DriveMLM: "DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving", arXiv, 2023 (shanghai AI Lab). [Paper][Code (in construction)]
- DriveLM: "DriveLM: Driving with Graph Visual Question Answering", arXiv, 2023 (OpenDriveLab, China). [Paper][Code]
- VLP: "VLP: Vision Language Planning for Autonomous Driving", arXiv, 2024 (Bosch). [Paper]
- DriveVLM: "DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models", arXiv, 2024 (Tsinghua). [Paper][Website]
- DriveDreamer-2: "DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation", arXiv, 2024 (CAS). [Paper][Code (in construction)][Website]
- OmniDrive: "OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning", arXiv, 2024 (NVIDIA). [Paper][Code (in construction)]
- DriveSim: "Probing Multimodal LLMs as World Models for Driving", arXiv, 2024 (MIT). [Paper][Code (in construction)]
Trajectory Prediction:
- mmTransformer: "Multimodal Motion Prediction with Stacked Transformers", CVPR, 2021 (CUHK + SenseTime). [Paper][Code (in construction)][Website]
- AgentFormer: "AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting", ICCV, 2021 (CMU). [Paper][PyTorch][Website]
- S2TNet: "S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving", ACML, 2021 (Xi'an Jiaotong University). [Paper][PyTorch]
- MRT: "Multi-Person 3D Motion Prediction with Multi-Range Transformers", NeurIPS, 2021 (UCSD + Berkeley). [Paper][PyTorch][Website]
- ?: "Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction", ICLR, 2022 (MILA). [Paper]
- Scene-Transformer: "Scene Transformer: A unified architecture for predicting multiple agent trajectories", ICLR, 2022 (Google). [Paper]
- ST-MR: "Graph-based Spatial Transformer with Memory Replay for Multi-Future Pedestrian Trajectory Prediction", CVPR, 2022 (University of New South Wales, Australia). [Paper][Tensorflow]
- HiVT: "HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction", CVPR, 2022 (CUHK). [Paper]
- EF-Transformer: "Entry-Flipped Transformer for Inference and Prediction of Participant Behavior", ECCV, 2022 (NTU, Singapore). [Paper]
- Social-SSL: "Social-SSL: Self-Supervised Cross-Sequence Representation Learning Based on Transformers for Multi-Agent Trajectory Prediction", ECCV, 2022 (NYCU). [Paper][PyTorch]
- LatentFormer: "LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction", arXiv, 2022 (Huawei). [Paper]
- PreTR: "PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer", arXiv, 2022 (Stellantis, France). [Paper]
- Wayformer: "Wayformer: Motion Forecasting via Simple & Efficient Attention Networks", arXiv, 2022 (Waymo). [Paper]
- LaTTe: "LaTTe: Language Trajectory TransformEr", arXiv, 2022 (TUM). [Paper][Tensorflow]
- SoMoFormer: "SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
- ViewBirdiformer: "ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view", arXiv, 2022 (Kyoto University). [Paper]
- PedFormer: "PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning", arXiv, 2022 (Huawei). [Paper]
- TAMFormer: "TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction", arXiv, 2022 (University of Padova, Italy). [Paper]
- QCNet: "Query-Centric Trajectory Prediction", CVPR, 2023 (CUHK). [Paper][Code (in construction)]
- ViP3D: "ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries", CVPR, 2023 (Tsinghua). [Paper][PyTorch][Website]
- USST: "Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting", ICCV, 2023 (OPPO). [Paper][PyTorch][Website]
- JRTransformer: "Joint-Relation Transformer for Multi-Person Motion Prediction", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
- Forecast-MAE: "Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders", ICCV, 2023 (HKUST). [Paper][PyTorch]
- MotionLM: "MotionLM: Multi-Agent Motion Forecasting as Language Modeling", ICCV, 2023 (Waymo). [Paper]
- OccFormer: "OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction", ICCV, 2023 (PhiGent Robotics, China). [Paper][PyTorch]
- Traj-MAE: "Traj-MAE: Masked Autoencoders for Trajectory Prediction", ICCV, 2023 (CUHK). [Paper]
- R-Pred: "R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement", ICCV, 2023 (Hanyang University, Korea). [Paper]
- MacFormer: "MacFormer: Map-Agent Coupled Transformer for Real-time and Robust Trajectory Prediction", RAL, 2023 (HKUST). [Paper]
- HPTR: "Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding", NeurIPS, 2023 (ETHZ). [Paper][PyTorch]
- InCrowdFormer: "InCrowdFormer: On-Ground Pedestrian World Model From Egocentric Views", arXiv, 2023 (Kyoto University). [Paper]
- MTR++: "MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying", arXiv, 2023 (MPI). [Paper]
- T4P: "T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory", CVPR, 2024 (KAIST). [Paper][Code (in construction)]
- MoST: "MoST: Multi-modality Scene Tokenization for Motion Prediction", CVPR, 2024 (Waymo). [Paper]
Visual Counting:
- CC-AV: "Audio-Visual Transformer Based Crowd Counting", ICCVW, 2021 (University of Kansas). [Paper]
- TransCrowd: "TransCrowd: Weakly-Supervised Crowd Counting with Transformer", arXiv, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
- TAM-RTM: "Boosting Crowd Counting with Transformers", arXiv, 2021 (ETHZ). [Paper]
- CCTrans: "CCTrans: Simplifying and Improving Crowd Counting with Transformer", arXiv, 2021 (Meituan). [Paper]
- MAN: "Boosting Crowd Counting via Multifaceted Attention", CVPR, 2022 (Xi'an Jiaotong). [Paper][PyTorch]
- CLTR: "An End-to-End Transformer Model for Crowd Localization", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch][Website]
- SAANet: "Scene-Adaptive Attention Network for Crowd Counting", arXiv, 2022 (Xi'an Jiaotong). [Paper]
- JCTNet: "Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting", arXiv, 2022 (Chongqing University). [Paper]
- CrowdMLP: "CrowdMLP: Weakly-Supervised Crowd Counting via Multi-Granularity MLP", arXiv, 2022 (University of Guelph, Canada). [Paper]
- CounTR: "CounTR: Transformer-based Generalised Visual Counting", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][Website]
- CrowdCLIP: "CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model", CVPR, 2023 (Baidu). [Paper][Code (in construction)]
- PET: "Point-Query Quadtree for Crowd Counting, Localization, and More", ICCV, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
- CLIP-Count: "CLIP-Count: Towards Text-Guided Zero-Shot Object Counting", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
- ?: "Training-free Object Counting with Prompts", arXiv, 2023 (A\⋆STAR). [Paper][PyTorch]
- T-Rex: "T-Rex: Counting by Visual Prompting", arXiv, 2023 (IDEA). [Paper][Website]
- VLCounter: "VLCounter: Text-aware VIsual Representation for Zero-Shot Object Counting", AAAI, 2024 (Sungkyunkwan University, Korea). [Paper][PyTorch]
- Gramformer: "Gramformer: Learning Crowd Counting via Graph-Modulated Transformer", AAAI, 2024 (Xi'an Jiaotong). [Paper][Code (in construction)]
Visual Quality Assessment:
- TRIQ: "Transformer for Image Quality Assessment", arXiv, 2020 (NORCE). [Paper][Tensorflow-Keras]
- IQT: "Perceptual Image Quality Assessment with Transformers", CVPRW, 2021 (LG). [Paper][Code (in construction)]
- MUSIQ: "MUSIQ: Multi-scale Image Quality Transformer", ICCV, 2021 (Google). [Paper]
- TranSLA: "Saliency-Guided Transformer Network Combined With Local Embedding for No-Reference Image Quality Assessment", ICCVW, 2021 (Hikvision). [Paper]
- TReS: "No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency", WACV, 2022 (CMU). [Paper]
- IQA-Conformer: "Conformer and Blind Noisy Students for Improved Image Quality Assessment", CVPRW, 2022 (University of Wurzburg, Germany). [Paper][PyTorch]
- SwinIQA: "SwinIQA: Learned Swin Distance for Compressed Image Quality Assessment", CVPRW, 2022 (USTC, China). [Paper]
- DCVQE: "DCVQE: A Hierarchical Transformer for Video Quality Assessment", ACCV, 2022 (Weibo). [Paper]
- MCAS-IQA: "Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment", arXiv, 2022 (Norwegian Research Centre, Norway). [Paper]
- MSTRIQ: "MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion", arXiv, 2022 (ByteDance). [Paper]
- DisCoVQA: "DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment", arXiv, 2022 (NTU, Singapore). [Paper]
- LIQE: "Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective", CVPR, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
- MRET: "MRET: Multi-resolution Transformer for Video Quality Assessment", arXiv, 2023 (Google). [Paper]
- SAM-IQA: "SAM-IQA: Can Segment Anything Boost Image Quality Assessment?", arXiv, 2023 (Megvii). [Paper][Code (in construction)]
- LoDa: "Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment", arXiv, 2023 (Wuhan University). [Paper]
- Q-Align: "Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- SAMA: "Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment", AAAI, 2024 (Xidian University). [Paper][PyTorch]
- Co-Instruct: "Towards Open-ended Visual Quality Comparison", arXiv, 2024 (NTU, Singapore). [Paper][Model]
- ?: "A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment", arXiv, 2024 (Tsinghua). [Paper][Code (in construction)]
Visual Reasoning:
- SAViR-T: "SAViR-T: Spatially Attentive Visual Reasoning with Transformers", arXiv, 2022 (Rutgers University). [Paper]
Wide-angle lenses:
- DarSwin: "DarSwin: Distortion Aware Radial Swin Transformer", ICCV, 2023 (Laval University, Canada). [Paper][PyTorch][Website]
3D Human Texture Estimation:
- Texformer: "3D Human Texture Estimation from a Single Image with Transformers", ICCV, 2021 (NTU, Singapore). [Paper][PyTorch][Website]
3D Motion Synthesis:
- ACTOR: "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV, 2021 (Univ Gustave Eiffel). [Paper][PyTorch][Website]
- RTVAE: "Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis", CVPRW, 2022 (Amazon). [Paper]
- MotionCLIP: "MotionCLIP: Exposing Human Motion Generation to CLIP Space", ECCV, 2022 (Tel Aviv). [Paper]
- CLIP-Actor: "CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes", ECCV, 2022 (POSTECH). [Paper][PyTorch][Website]
- PoseGPT: "PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting", ECCV, 2022 (NAVER). [Paper]
- TEMOS: "TEMOS: Generating diverse human motions from textual descriptions", ECCV, 2022 (MPI). [Paper][PyTorch][Website]
- TM2T: "TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts", ECCV, 2022 (University of Alberta, Canada). [Paper][PyTorch][Website]
- HUMANISE: "HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes", NeurIPS, 2022 (Beijing Institute of Technology). [Paper][GitHub][Website]
- ?: "Diverse Dance Synthesis via Keyframes with Transformer Controllers", arXiv, 2022 (Beihang University). [Paper]
- MARIONET: "NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System", arXiv, 2022 (Wuhan University). [Paper]
- Action-GPT: "Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation", arXiv, 2022 (IIIT Hyderabad). [Paper][Website]
- MDM: "Human Motion Diffusion Model", ICLR, 2023 (Tel Aviv University). [Paper][PyTorch][Website]
- POTTER: "POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery", CVPR, 2023 (OPPO). [Paper][PyTorch][Website]
- Optimus: "Transformer-Based Learned Optimization", CVPR, 2023 (Google). [Paper]
- CITL: "Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
- OOHMG: "Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training", CVPR, 2023 (Sun Yat-Sen University). [Paper][Code (in construction)]
- AttT2M: "AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism", ICCV, 2023 (CAS). [Paper][PyTorch]
- ActFormer: "ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation", ICCV, 2023 (SenseTime). [Paper]
- AvatarJLM: "Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling", ICCV, 2023 (ByteDance). [Paper][PyTorch][Website]
- Fg-T2M: "Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model", ICCV, 2023 (Beihang). [Paper]
- TMR: "TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis", ICCV, 2023 (Gustave Eiffel University). [Paper][PyTorch][Website]
- Make-An-Animation: "Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation", ICCV, 2023 (Meta). [Paper]
- ATOM: "Language-guided Human Motion Synthesis with Atomic Actions", ACMMM, 2023 (University at Buffalo). [Paper][Code (in construction)]
- MotionGPT: "MotionGPT: Human Motion as a Foreign Language", NeurIPS, 2023 (Fudan). [Paper][PyTorch][Website]
- FineMoGen: "FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- DDT: "DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video", arXiv, 2023 (OPPO). [Paper]
- MotionGPT: "MotionGPT: Finetuned LLMs are General-Purpose Motion Generators", arXiv, 2023 (USTC). [Paper][PyTorch (in construction)][Website]
- UNIMASK-M: "A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis", arXiv, 2023 (Technische Universitat Wien (TUWien), Austria). [Paper][Website]
- MMM: "MMM: Generative Masked Motion Model", arXiv, 2023 (UNC). [Paper][Code (in construction)][Website]
- HOI-Diff: "HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models", arXiv, 2023 (Northeastern). [Paper][Website]
- OMG: "OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers", arXiv, 2023 (ShanghaiTech). [Paper]
- LEMON: "LEMON: Learning 3D Human-Object Interaction Relation from 2D Images", arXiv, 2023 (USTC). [Paper][Code (in construction)][Website]
- MoST: "MoST: Motion Style Transformer between Diverse Action Contents", CVPR, 2024 (Korea Electronics Technology Institute). [Paper][Code (in construction)]
- AMDM: "Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance", CVPR, 2024 (BIGAI). [Paper][Code (in construction)][Website]
- ?: "Generating Human Motion in 3D Scenes from Text Descriptions", CVPR, 2024 (Zhejiang). [Paper][Website]
- RoHM: "RoHM: Robust Human Motion Reconstruction via Diffusion", arXiv, 2024 (Meta). [Paper][Code (in construction)][Website]
- STMC: "Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation", arXiv, 2024 (NVIDIA). [Paper][Website]
- Motion-Mamba: "Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM", arXiv, 2024 (Monash). [Paper][code (in construction)][Website]
3D Object Recognition:
- MVT: "MVT: Multi-view Vision Transformer for 3D Object Recognition", BMVC, 2021 (Baidu). [Paper]
3D Reconstruction:
- PlaneTR: "PlaneTR: Structure-Guided Transformers for 3D Plane Recovery", ICCV, 2021 (Wuhan University). [Paper][PyTorch]
- CO3D: "CommonObjects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction", ICCV, 2021 (Facebook). [Paper][PyTorch]
- VolT: "Multi-view 3D Reconstruction with Transformer", ICCV, 2021 (University of British Columbia). [Paper]
- 3D-RETR: "3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers", BMVC, 2021 (ETHZ). [Paper][PyTorch]
- TransformerFusion: "TransformerFusion: Monocular RGB Scene Reconstruction using Transformers", NeurIPS, 2021 (TUM). [Paper][Website]
- LegoFormer: "LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction", arXiv, 2021 (TUM + Google). [Paper]
- PlaneFormers: "PlaneFormers: From Sparse View Planes to 3D Reconstruction", ECCV, 2022 (UMich). [Paper][PyTorch][Website]
- 3D-C2FT: "3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction", arXiv, 2022 (Korea Institute of Science and Technology). [Paper]
- SDF-Former: "Monocular Scene Reconstruction with 3D SDF Transformers", ICLR, 2023 (Alibaba). [Paper][Website]
- AMVUR: "A Probabilistic Attention Model with Occlusion-aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image", CVPR, 2023 (Lancaster University, UK). [Paper]
- LIST: "LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction", ICCV, 2023 (UT Arlington). [Paper]
- LRGT: "Long-Range Grouping Transformer for Multi-View 3D Reconstruction", ICCV, 2023 (Macau University of Science and Technology). [Paper][PyTorch (in construction)]
- Spectral-Graphormer: "Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images", ICCV, 2023 (Google). [Paper]
- UMIFormer: "UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction", ICCV, 2023 (Macau University of Science and Technology). [Paper][PyTorch]
- PlaneRecTR: "PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View", ICCV, 2023 (National University of Defense Technology, China). [Paper][PyTorch]
- HaMeR: "Reconstructing Hands in 3D with Transformers", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
- KYN: "Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning", CVPR, 2024 (ETHZ). [Paper][PyTorch][Website]
- MCC-HO: "Reconstructing Hand-Held Objects in 3D", arXiv, 2024 (Berkeley). [Paper]
360 Scene:
- ?: "Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning", AAAI, 2022 (Seoul National University). [Paper][PyTorch]
- PAVER: "Panoramic Vision Transformer for Saliency Detection in 360° Videos", ECCV, 2022 (Seoul National University). [Paper]
- PanoFormer: "PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation", ECCV, 2022 (Beijing Jiaotong University). [Paper]
- CoVisPose: "CoVisPose: Co-Visibility Pose Transformer for Wide-Baseline Relative Pose Estimation in 360° Indoor Panoramas", ECCV, 2022 (Zillow). [Paper]
- SPH: "Spherical Transformer", arXiv, 2022 (Chung-Ang University, Korea). [Paper]
- PanoSwin: "PanoSwin: a Pano-style Swin Transformer for Panorama Understanding", CVPR, 2023 (Fudan). [Paper][PyTorch]
- SalViT360: "Spherical Vision Transformer for 360-degree Video Saliency Prediction", BMVC, 2023 (Koc University, Turkey). [Paper]
- PanoContext-Former: "PanoContext-Former: Panoramic Total Scene Understanding with a Transformer", arXiv, 2023 (Alibaba). [Paper]
Others:
- ?: "Connecting Compression Spaces with Transformer for Approximate Nearest Neighbor Search", ECCV, 2022 (Intellifusion, China). [Paper]
- ?: "Strong Gravitational Lensing Parameter Estimation with Vision Transformer", ECCVW, 2022 (CMU). [Paper][PyTorch]
- Transformer-DR: "Transformer-based dimensionality reduction", arXiv, 2022 (Chongqing Normal University, China). [Paper]
- ?: "mm-Wave Radar Hand Shape Classification Using Deformable Transformers", arXiv, 2022 (Intel). [Paper]
- ?: "Fully-attentive and interpretable: vision and video vision transformers for pain detection", NeurIPSW, 2022 (Utrecht University, Netherlands). [Paper][Code (in construction)]
- CQFormer: "Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
- CircuitFormer: "Circuit as Set of Points", NeurIPS, 2023 (Horizon Robotics). [Paper][PyTorch]
- SleepVST: "SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers", CVPR, 2024 (Oxford). [Paper]

[Back to Overview]

Attention Mechanisms in Vision/NLP

Attention for Vision

AA: "Attention Augmented Convolutional Networks", ICCV, 2019 (Google). [Paper][PyTorch (Unofficial)][Tensorflow (Unofficial)]
LR-Net: "Local Relation Networks for Image Recognition", ICCV, 2019 (Microsoft). [Paper][PyTorch (Unofficial)]
CCNet: "CCNet: Criss-Cross Attention for Semantic Segmentation", ICCV, 2019 (& TPAMI 2020) (Horizon). [Paper][PyTorch]
GCNet: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) (Microsoft). [Paper][PyTorch]
SASA: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 (Google). [Paper][PyTorch-1 (Unofficial)][PyTorch-2 (Unofficial)]
- key message: attention module is more efficient than conv & provide comparable accuracy
Axial-Transformer: "Axial Attention in Multidimensional Transformers", arXiv, 2019 (Google). [Paper][PyTorch (Unofficial)]
Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
SAN: "Exploring Self-attention for Image Recognition", CVPR, 2020 (CUHK + Intel). [Paper][PyTorch]
BA-Transform: "Non-Local Neural Networks With Grouped Bilinear Attentional Transforms", CVPR, 2020 (ByteDance). [Paper]
Axial-DeepLab: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 (Google). [Paper][PyTorch]
GSA: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (Unofficial)]
EA: "Efficient Attention: Attention with Linear Complexities", WACV, 2021 (SenseTime). [Paper][PyTorch]
LambdaNetworks: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 (Google). [Paper][PyTorch-1 (Unofficial)][PyTorch-2 (Unofficial)]
GSA-Nets: "Group Equivariant Stand-Alone Self-Attention For Vision", ICLR, 2021 (EPFL). [Paper]
Hamburger: "Is Attention Better Than Matrix Decomposition?", ICLR, 2021 (Peking). [Paper][PyTorch (Unofficial)]
HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper]
BoTNet: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 (Google). [Paper]
SSAN: "SSAN: Separable Self-Attention Network for Video Representation Learning", CVPR, 2021 (Microsoft). [Paper]
CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
Involution: "Involution: Inverting the Inherence of Convolution for Visual Recognition", CVPR, 2021 (HKUST). [Paper][PyTorch]
Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
SNL: "Unifying Nonlocal Blocks for Neural Networks", ICCV, 2021 (Peking + Bytedance). [Paper]
External-Attention: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua). [Paper]
Container: "Container: Context Aggregation Network", arXiv, 2021 (AI2). [Paper]
X-volution: "X-volution: On the unification of convolution and self-attention", arXiv, 2021 (Huawei Hisilicon). [Paper]
Invertible-Attention: "Invertible Attention", arXiv, 2021 (ANU). [Paper]
VOLO: "VOLO: Vision Outlooker for Visual Recognition", arXiv, 2021 (Sea AI Lab + NUS, Singapore). [Paper][PyTorch]
LESA: "Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms", arXiv, 2021 (Johns Hopkins). [Paper]
PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][JAX]
?: "Fair Comparison between Efficient Attentions", CVPRW, 2022 (Kyungpook National University, Korea). [Paper][PyTorch]
KVT: "KVT: k-NN Attention for Boosting Vision Transformers", ECCV, 2022 (Alibaba). [Paper][PyTorch]
Hydra: "Hydra Attention: Efficient Attention with Many Heads", ECCVW, 2022 (Meta). [Paper]
HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
AttendNeXt: "Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers", arXiv, 2022 (University of Waterloo, Canada). [Paper]
Token-Mixing-Adaptive-FNO: "Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators", ICLR, 2022 (NVIDIA + Caltech + Stanford). [Paper][PyTorch]
KV-Transformer: "Key-Value Transformer", arXiv, 2023 (Quintic AI). [Paper]
NATTEN: "Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level", arXiv, 2024 (UIUC). [Paper][PyTorch][Website]

[Back to Overview]

Attention for NLP

T-DMCA: "Generating Wikipedia by Summarizing Long Sequences", ICLR, 2018 (Google). [Paper]
LSRA: "Lite Transformer with Long-Short Range Attention", ICLR, 2020 (MIT). [Paper][PyTorch]
ETC: "ETC: Encoding Long and Structured Inputs in Transformers", EMNLP, 2020 (Google). [Paper][Tensorflow]
BlockBERT: "Blockwise Self-Attention for Long Document Understanding", EMNLP Findings, 2020 (Facebook). [Paper][GitHub]
Clustered-Attention: "Fast Transformers with Clustered Attention", NeurIPS, 2020 (Idiap). [Paper][PyTorch][Website]
BigBird: "Big Bird: Transformers for Longer Sequences", NeurIPS, 2020 (Google). [Paper][Tensorflow]
Longformer: "Longformer: The Long-Document Transformer", arXiv, 2020 (AI2). [Paper][PyTorch]
Linformer: "Linformer: Self-Attention with Linear Complexity", arXiv, 2020 (Facebook). [Paper][PyTorch (Unofficial)]
Nystromformer: "Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention", AAAI, 2021 (UW-Madison). [Paper][PyTorch]
RFA: "Random Feature Attention", ICLR, 2021 (DeepMind). [Paper]
Performer: "Rethinking Attention with Performers", ICLR, 2021 (Google). [Paper][Code][Blog]
DeLight: "DeLighT: Deep and Light-weight Transformer", ICLR, 2021 (UW). [Paper]
Synthesizer: "Synthesizer: Rethinking Self-Attention for Transformer Models", ICML, 2021 (Google). [Paper][Tensorflow][PyTorch (leaderj1001)]
Poolingformer: "Poolingformer: Long Document Modeling with Pooling Attention", ICML, 2021 (Microsoft). [Paper]
Hi-Transformer: "Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling", ACL, 2021 (Tsinghua). [Paper]
Smart-Bird: "Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer", arXiv, 2021 (Tsinghua). [Paper]
Fastformer: "Fastformer: Additive Attention is All You Need", arXiv, 2021 (Tsinghua). [Paper]
∞-former: "∞-former: Infinite Memory Transformer", arXiv, 2021 (Instituto de Telecomunicações, Portugal). [Paper]
cosFormer: "cosFormer: Rethinking Softmax In Attention", ICLR, 2022 (SenseTime). [Paper][PyTorch (davidsvy)]
MGK: "Improving Transformers with Probabilistic Attention Keys", ICML, 2022 (UCLA). [Paper]
FNet: "FNet: Mixing Tokens with Fourier Transforms", NAACL, 2022 (Google). [Paper]
RetNet: "Retentive Network: A Successor to Transformer for Large Language Models", arXiv, 2023 (Microsoft). [Paper][PyTorch (in construction)]

[Back to Overview]

Attention for Both

Sparse-Transformer: "Generating Long Sequences with Sparse Transformers", arXiv, 2019 (OpenAI). [Paper][Tensorflow][Blog]
Reformer: "Reformer: The Efficient Transformer", ICLR, 2020 (Google). [Paper][Tensorflow][Blog]
Sinkhorn-Transformer: "Sparse Sinkhorn Attention", ICML, 2020 (Google). [Paper][PyTorch (Unofficial)]
Linear-Transformer: "Transformers are rnns: Fast autoregressive transformers with linear attention", ICML, 2020 (Idiap). [Paper][PyTorch][Website]
SMYRF: "SMYRF: Efficient Attention using Asymmetric Clustering", NeurIPS, 2020 (UT Austin + Google). [Paper][PyTorch]
Routing-Transformer: "Efficient Content-Based Sparse Attention with Routing Transformers", TACL, 2021 (Google). [Paper][Tensorflow][PyTorch (Unofficial)][Slides]
LRA: "Long Range Arena: A Benchmark for Efficient Transformers", ICLR, 2021 (Google). [Paper][Tensorflow]
OmniNet: "OmniNet: Omnidirectional Representations from Transformers", ICML, 2021 (Google). [Paper]
Evolving-Attention: "Evolving Attention with Residual Convolutions", ICML, 2021 (Peking + Microsoft). [Paper]
H-Transformer-1D: "H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences", ACL, 2021 (Google). [Paper]
Combiner: "Combiner: Full Attention Transformer with Sparse Computation Cost", NeurIPS, 2021 (Google). [Paper]
Centroid-Transformer: "Centroid Transformers: Learning to Abstract with Attention", arXiv, 2021 (UT Austin). [Paper]
AFT: "An Attention Free Transformer", arXiv, 2021 (Apple). [Paper]
Luna: "Luna: Linear Unified Nested Attention", arXiv, 2021 (USC + CMU + Facebook). [Paper]
Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", arXiv, 2021 (NVIDIA). [Paper]
PoNet: "PoNet: Pooling Network for Efficient Token Mixing in Long Sequences", ICLR, 2022 (Alibaba). [Paper]
Paramixer: "Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention", CVPR, 2022 (Norwegian University of Science and Technology, Norway). [Paper]
FNet: "FNet: Mixing Tokens with Fourier Transforms", NAACL, 2022 (Google). [Paper][JAX]
ContextPool: "Efficient Representation Learning via Adaptive Context Pooling", ICML, 2022 (Apple). [Paper]
LARA: "Linear Complexity Randomized Self-attention Mechanism", ICML, 2022 (Bytedance). [Paper]
Flowformer: "Flowformer: Linearizing Transformers with Conservation Flows", ICML, 2022 (Tsinghua University). [Paper][PyTorch]
MRA: "Multi Resolution Analysis (MRA) for Approximate Self-Attention", ICML, 2022 (University of Wisconsin, Madison). [Paper][PyTorch]
EcoFormer: "EcoFormer: Energy-Saving Attention with Linear Complexity", NeurIPS, 2022 (Monash University). [Paper][PyTorch]
SBM-Transformer: "Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost", NeurIPS, 2022 (LG). [Paper][PyTorch]
?: "Horizontal and Vertical Attention in Transformers", arXiv, 2022 (University of Technology Sydney). [Paper]
MRL: "MRL: Learning to Mix with Attention and Convolutions", arXiv, 2022 (Sony). [Paper]
RSA: "Encoding Recurrence into Transformers", ICLR, 2023 (HKU). [Paper]
EVA: "Efficient Attention via Control Variates", ICLR, 2023 (HKU). [Paper]
STTABT: "Sparse Token Transformer with Attention Back Tracking", ICLR, 2023 (KAIST). [Paper]
Mega: "Mega: Moving Average Equipped Gated Attention", ICLR, 2023 (Meta). [Paper][PyTorch]
SeTformer: "SeTformer is What You Need for Vision and Language", AAAI, 2024 (East China Normal University). [Paper]

[Back to Overview]

Attention for Others

Informer: "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting", AAAI, 2021 (Beihang University). [Paper][PyTorch]
Attention-Rank-Collapse: "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth", ICML, 2021 (Google + EPFL). [Paper][PyTorch]
NPT: "Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning", arXiv, 2021 (Oxford). [Paper]
FEDformer: "FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting", ICML, 2022 (Alibaba). [Paper][PyTorch]
?: "Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting", arXiv, 2022 (University of Technology Sydney). [Paper]

[Back to Overview]

Files

README_2.md

Latest commit

History

README_2.md

File metadata and controls

Overview

Citation

Other High-level Vision Tasks

Point Cloud / 3D

Pose Estimation

Tracking

Re-ID

Face

Neural Architecture Search

Scene Graph

Transfer / X-Supervised / X-Shot / Continual Learning

Low-level Vision Tasks

Image Restoration

Video Restoration

Inpainting / Completion / Outpainting

Image Generation

Video Generation

Transfer / Translation / Manipulation

Other Low-Level Tasks

Reinforcement Learning

Navigation

Other RL Tasks

Medical

Medical Segmentation

Medical Classification

Medical Detection

Medical Reconstruction

Medical Low-Level Vision

Medical Vision-Language

Medical Others

Other Tasks

Attention Mechanisms in Vision/NLP

Attention for Vision

Attention for NLP

Attention for Both

Attention for Others