Skip to content

Latest commit

 

History

History
2349 lines (2281 loc) · 434 KB

README_2.md

File metadata and controls

2349 lines (2281 loc) · 434 KB

(back to README.md and README_multimodal.md for other categories)

Overview


Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Other High-level Vision Tasks

Point Cloud / 3D

  • PCT: "PCT: Point Cloud Transformer", arXiv, 2020 (Tsinghua). [Paper][Jittor][PyTorch (uyzhang)]
  • Point-Transformer: "Point Transformer", arXiv, 2020 (Ulm University). [Paper]
  • NDT-Transformer: "NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation", ICRA, 2021 (University of Sheffield). [Paper][PyTorch]
  • P4Transformer: "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos", CVPR, 2021 (NUS). [Paper]
  • SnowflakeNet: "SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer", ICCV, 2021 (Tsinghua). [Paper][PyTorch]
  • PoinTr: "PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers", ICCV, 2021 (Tsinghua). [Paper][PyTorch]
  • Point-Transformer: "Point Transformer", ICCV, 2021 (Oxford + CUHK). [Paper][PyTorch (lucidrains)]
  • CT: "Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks", ICCV, 2021 (Samsung). [Paper]
  • 3DVG-Transformer: "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds", ICCV, 2021 (Beihang University). [Paper]
  • PPT-Net: "Pyramid Point Cloud Transformer for Large-Scale Place Recognition", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
  • ?: "Shape registration in the time of transformers", NeurIPS, 2021 (Sapienza University of Rome). [Paper]
  • YOGO: "You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module", arXiv, 2021 (Berkeley). [Paper][PyTorch]
  • DTNet: "Dual Transformer for Point Cloud Analysis", arXiv, 2021 (Southwest University). [Paper]
  • MLMSPT: "Point Cloud Learning with Transformer", arXiv, 2021 (Southwest University). [Paper]
  • PQ-Transformer: "PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds", arXiv, 2021 (Tsinghua). [Paper][PyTorch]
  • PST2: "Spatial-Temporal Transformer for 3D Point Cloud Sequences", WACV, 2022 (Sun Yat-sen University). [Paper]
  • SCTN: "SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation", AAAI, 2022 (KAUST). [Paper]
  • AWT-Net: "Adaptive Wavelet Transformer Network for 3D Shape Representation Learning", ICLR, 2022 (NYU). [Paper]
  • ?: "Deep Point Cloud Reconstruction", ICLR, 2022 (KAIST). [Paper]
  • PointMLP: "Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework", ICLR, 2022 (Northeastern). [Paper][PyTorch]
  • HiTPR: "HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud", ICRA, 2022 (Nanjing University of Science and Technology). [Paper]
  • FastPointTransformer: "Fast Point Transformer", CVPR, 2022 (POSTECH). [Paper]
  • REGTR: "REGTR: End-to-end Point Cloud Correspondences with Transformers", CVPR, 2022 (NUS, Singapore). [Paper][PyTorch]
  • ShapeFormer: "ShapeFormer: Transformer-based Shape Completion via Sparse Representation", CVPR, 2022 (Shenzhen University). [Paper][Website]
  • PatchFormer: "PatchFormer: An Efficient Point Transformer with Patch Attention", CVPR, 2022 (Hangzhou Dianzi University). [Paper]
  • ?: "An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation", CVPR, 2022 (NTU + NYCU). [Paper][Code (in construction)]
  • Point-BERT: "Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling", CVPR, 2022 (Tsinghua). [Paper][PyTorch][Website]
  • GeoTransformer: "Geometric Transformer for Fast and Robust Point Cloud Registration", CVPR, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
  • PointCLIP: "PointCLIP: Point Cloud Understanding by CLIP", CVPR, 2022 (Shanghai AI Lab). [Paper][PyTorch]
  • ?: "3D Part Assembly Generation with Instance Encoded Transformer", IROS, 2022 (Tongji University). [Paper]
  • SeedFormer: "SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer", ECCV, 2022 (Tencent). [Paper][PyTorch]
  • MeshMAE: "MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis", ECCV, 2022 (JD). [Paper]
  • PPTr: "Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding", ECCV, 2022 (Tsinghua University). [Paper]
  • Geodesic-Former: "Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter", ECCV, 2022 (VinAI Research, Vietnam). [Paper]
  • LaplacianMesh-Transformer: "Laplacian Mesh Transformer: Dual Attention and Topology Aware Network for 3D Mesh Classification and Segmentation", ECCV, 2022 (CAS). [Paper]
  • Point-MixSwap: "Point MixSwap: Attentional Point Cloud Mixing via Swapping Matched Structural Divisions", ECCV, 2022 (NYCU + NTU). [Paper][PyTorch]
  • PointMixer: "PointMixer: MLP-Mixer for Point Cloud Understanding", ECCV, 2022 (KAIST). [Paper]
  • Point-Transformer-V2: "Point Transformer V2: Grouped Vector Attention and Partition-based Pooling", NeurIPS, 2022 (HKU). [Paper][PyTorch (in construction)]
  • SPoVT: "SPoVT: Semantic-Prototype Variational Transformer for Dense Point Cloud Semantic Completion", NeurIPS, 2022 (NTU). [Paper][PyTorch][Website]
  • GSA: "Geodesic Self-Attention for 3D Point Clouds", NeurIPS, 2022 (East China Normal University). [Paper]
  • P2P: "P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch][Website]
  • 3DTRL: "Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space", NeurIPS, 2022 (Stony Brook). [Paper][PyTorch][Website]
  • ShapeCrafter: "ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model", NeurIPS, 2022 (Brown). [Paper]
  • XMFnet: "Cross-modal Learning for Image-Guided Point Cloud Shape Completion", NeurIPS, 2022 (Politecnico di Torino, Italy). [Paper]
  • Point-M2AE: "Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training", NeurIPS, 2022 (CUHK). [Paper][PyTorch]
  • LighTN: "LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling", arXiv, 2022 (Beijing Jiaotong University). [Paper]
  • PMP-Net++: "PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths", arXiv, 2022 (Tsinghua). [Paper]
  • SnowflakeNet: "Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
  • 3DCTN: "3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification", arXiv, 2022 (University of Waterloo, Canada). [Paper]
  • VNT-Net: "VNT-Net: Rotational Invariant Vector Neuron Transformers", arXiv, 2022 (Ben-Gurion University of the Negev, Israel). [Paper]
  • CompleteDT: "CompleteDT: Point Cloud Completion with Dense Augment Inference Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • VN-Transformer: "VN-Transformer: Rotation-Equivariant Attention for Vector Neurons", arXiv, 2022 (Waymo). [Paper]
  • Voxel-MAE: "Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds", arXiv, 2022 (Chalmers University of Technology, Sweden). [Paper]
  • MAE3D: "Masked Autoencoders in 3D Point Cloud Representation Learning", arXiv, 2022 (Northwest A&F University, China). [Paper]
  • Pix4Point: "Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding", arXiv, 2022 (KAUST). [Paper][Code (in construction)]
  • MVP: "Multiple View Performers for Shape Completion", arXiv, 2022 (Columbia University). [Paper]
  • Simple3D-Former: "Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?", arXiv, 2022 (UT Austin). [Paper][PyTorch]
  • 3DPCT: "3DPCT: 3D Point Cloud Transformer with Dual Self-attention", arXiv, 2022 (University of Waterloo, Canada). [Paper]
  • PS-Former: "Point Cloud Recognition with Position-to-Structure Attention Transformers", arXiv, 2022 (UCSD). [Paper]
  • LCPFormer: "LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context Propagation in Transformers", arXiv, 2022 (Aberystwyth University, UK). [Paper]
  • R2-MLP: "R2-MLP: Round-Roll MLP for Multi-View 3D Object Recognition", arXiv, 2022 (Baidu). [Paper]
  • PVT3D: "PVT3D: Point Voxel Transformers for Place Recognition from Sparse Lidar Scans", arXiv, 2022 (TUM). [Paper]
  • EPCL: "Frozen CLIP Model is Efficient Point Cloud Backbone", arXiv, 2022 (Shanghai AI Lab). [Paper]
  • CAT: "Context-Aware Transformer for 3D Point Cloud Automatic Annotation", AAAI, 2023 (HKU). [Paper]
  • ACT: "Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?", ICLR, 2023 (Megvii). [Paper][PyTorch]
  • AnalogicalNets: "Analogy-Forming Transformers for Few-Shot 3D Parsing", ICLR, 2023 (CMU). [Paper][Website]
  • ViPFormer: "ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised Pointcloud Understanding", ICRA, 2023 (Renmin University of China). [Paper][PyTorch]
  • ProxyFormer: "ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer", CVPR, 2023 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • I2P-MAE.: "Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • RoITr: "Rotation-Invariant Transformer for Point Cloud Matching", CVPR, 2023 (TUM). [Paper]
  • SphereFormer: "Spherical Transformer for LiDAR-based 3D Recognition", CVPR, 2023 (CUHK). [Paper][PyTorch]
  • SPoTr: "Self-positioning Point-based Transformer for Point Cloud Understanding", CVPR, 2023 (Korea University). [Paper][PyTorch (in construction)]
  • PointCMP: "PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
  • GeoMAE: "GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training", CVPR, 2023 (Tsinghua). [Paper][Code (in construction)]
  • ULIP: "ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding", CVPR, 2023 (Salesforce). [Paper][PyTorch][Website]
  • PointConvFormer: "PointConvFormer: Revenge of the Point-based Convolution", CVPR, 2023 (Apple). [Paper]
  • AnchorFormer: "AnchorFormer: Point Cloud Completion from Discriminative Nodes", CVPR, 2023 (USTC). [Paper][PyTorch]
  • FlatFormer: "FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer", CVPR, 2023 (MIT). [Paper][Website]
  • PEAL: "PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration", CVPR, 2023 (Hangzhou Dianzi University). [Paper]
  • APES: "Attention-based Point Cloud Edge Sampling", CVPR, 2023 (Karlsruhe Institute of Technology, Germany). [Paper]
  • GD-MAE: "GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • ShapeClipper: "ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency", CVPR, 2023 (Georgia Tech). [Paper][Code (in construction)][Website]
  • MSC: "Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning", CVPR, 2023 (HKU). [Paper][PyTorch]
  • MSP: "Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding", CVPR, 2023 (MPI). [Paper]
  • MM-3DScene: "MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency", CVPR, 2023 (CAS). [Paper][PyTorch][Website]
  • ?: "Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction", ICML, 2023 (UT Austin). [Paper]
  • ReCon: "Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining", ICML, 2023 (Megvii). [Paper][PyTorch]
  • OctFormer: "OctFormer: Octree-based Transformers for 3D Point Clouds", SIGGRAPH, 2023 (Peking University). [Paper][Code (in construction)][Website]
  • SVDFormer: "SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator", ICCV, 2023 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • TAP: "Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
  • MATE: "MATE: Masked Autoencoders are Online 3D Test-Time Learners", ICCV, 2023 (Graz University of Technology, Austria). [Paper][PyTorch]
  • DeFormer: "DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image", ICCV, 2023 (Rutgers). [Paper]
  • RegFormer: "RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
  • PointCLIP-V2: "PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning", ICCV, 2023 (CUHK). [Paper][PyTorch]
  • CLIP2Point: "CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training", ICCV, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
  • IDPT: "Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
  • JM3D: "Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation", ACMMM, 2023 (NetEase, China). [Paper][PyTorch]
  • Bridge3D: "Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models", NeurIPS, 2023 (Clemson). [Paper][Code (in construction)]
  • ConDaFormer: "ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding", NeurIPS, 2023 (JD). [Paper][PyTorch]
  • DiT-3D: "DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation", NeurIPS, 2023 (Huawei). [Paper][PyTorch][Website]
  • OpenShape: "OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding", NeurIPS, 2023 (UCSD). [Paper][PyTorch][Website]
  • PointGPT: "PointGPT: Auto-regressively Generative Pre-training from Point Clouds", NeurIPS, 2023 (Beijing Institute of Technology). [Paper]
  • PIC: "Explore In-Context Learning for 3D Point Cloud Understanding", NeurIPS, 2023 (Sun Yat-sen University). [Paper][PyTorch]
  • GeoTransformer: "GeoTransformer: Fast and Robust Point Cloud Registration with Geometric Transformer", TPAMI, 2023 (National University of Defense Technology, China). [Paper][PyTorch]
  • Text4Point: "Joint Representation Learning for Text and 3D Point Cloud", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
  • FullFormer: "FullFormer: Generating Shapes Inside Shapes", arXiv, 2023 (University of Siegen, Germany). [Paper]
  • Joint-MAE: "Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training", arXiv, 2023 (CUHK). [Paper]
  • PointCAT: "PointCAT: Cross-Attention Transformer for point cloud", arXiv, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • MGT: "Multi-scale Geometry-aware Transformer for 3D Point Cloud Classification", arXiv, 2023 (TUM). [Paper]
  • Swin3D: "Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
  • ViewFormer: "ViewFormer: View Set Attention for Multi-view 3D Shape Understanding", arXiv, 2023 (Renmin University of China). [Paper]
  • ULIP-2: "ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding", arXiv, 2023 (Salesforce). [Paper]
  • CDFormer: "Collect-and-Distribute Transformer for 3D Point Cloud Analysis", arXiv, 2023 (The University of Sydney). [Paper][PyTorch]
  • PointCAM: "Self-supervised adversarial masking for 3D point cloud representation learning", arXiv, 2023 (Wrocław University of Science and Technology, Poland). [Paper][PyTorch]
  • PPT: "Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training", arXiv, 2023 (HKU). [Paper][PyTorch]
  • Uni3D: "Uni3D: Exploring Unified 3D Representation at Scale", arXiv, 2023 (BAAI). [Paper][PyTorch]
  • JM3D: "JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
  • PonderV2: "PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
  • MeshGPT: "MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers", arXiv, 2023 (TUM). [Paper][Website]
  • PTv3: "Point Transformer V3: Simpler, Faster, Stronger", arXiv, 2023 (HKU). [Paper][Code (in construction)]
  • 3D-LFM: "3D-LFM: Lifting Foundation Model", arXiv, 2023 (CMU). [Paper][Code][Webite]
  • LAST-PCL: "Language-Assisted 3D Scene Understanding", AAAI, 2024 (Peking). [Paper][Code (in construction)]
  • MM-Point: "MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding", AAAI, 2024 (Southeast University, China). [Paper]
  • Point-PEFT: "Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models", AAAI, 2024 (Shanghai AI Lab). [Paper][PyTorch]
  • DAPT: "Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis", CVPR, 2024 (Huazhong University of Science and Technology (HUST)). [Paper][PyTorch]
  • UniPVU-Human: "A Unified Framework for Human-centric Point Cloud Video Understanding", CVPR, 2024 (ShanghaiTech). [Paper]
  • PointMamba: "PointMamba: A Simple State Space Model for Point Cloud Analysis", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • Swin3D++: "Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding", arXiv, 2024 (Microsoft). [Paper]
  • PCM: "Point Could Mamba: Point Cloud Learning via State Space Model", arXiv, 2024 (Skywork AI, China). [Paper][Code (in construction)]
  • Point-Mamba: "Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy", arXiv, 2024 (Shanghai Jiao Tong). [Paper][PyTorch]
  • PIC-S: "Point-In-Context: Understanding Point Cloud via In-Context Learning", arXiv, 2024 (Peking). [Paper][Website][PyTorch]
  • ?: "Pose Priors from Language Models", arXiv, 2024 (Berkeley). [Paper]

[Back to Overview]

Pose Estimation

  • Human-body:
    • HOT-Net: "HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation", ACMMM. 2020 (Kwai). [Paper]
    • TransPose: "TransPose: Towards Explainable Human Pose Estimation by Transformer", arXiv, 2020 (Southeast University). [Paper][PyTorch]
    • PTF: "Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration", CVPR, 2021 (ETHZ). [Paper][Code (in construction)][Website]
    • METRO: "End-to-End Human Pose and Mesh Reconstruction with Transformers", CVPR, 2021 (Microsoft). [Paper][PyTorch]
    • PRTR: "Pose Recognition with Cascade Transformers", CVPR, 2021 (UCSD). [Paper][PyTorch]
    • Mesh-Graphormer: "Mesh Graphormer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
    • THUNDR: "THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers", ICCV, 2021 (Google). [Paper]
    • PoseFormer: "3D Human Pose Estimation with Spatial and Temporal Transformers", ICCV, 2021 (UCF). [Paper][PyTorch]
    • TransPose: "TransPose: Keypoint Localization via Transformer", ICCV, 2021 (Southeast University, China). [Paper][PyTorch]
    • POTR: "Pose Transformers (POTR): Human Motion Prediction With Non-Autoregressive Transformers", ICCVW, 2021 (Idiap). [Paper]
    • TransFusion: "TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation", BMVC, 2021 (UC Irvine). [Paper][PyTorch]
    • HRT: "HRFormer: High-Resolution Transformer for Dense Prediction", NeurIPS, 2021 (CAS). [Paper][PyTorch]
    • POET: "End-to-End Trainable Multi-Instance Pose Estimation with Transformers", arXiv, 2021 (EPFL). [Paper]
    • Lifting-Transformer: "Lifting Transformer for 3D Human Pose Estimation in Video", arXiv, 2021 (Peking). [Paper]
    • TFPose: "TFPose: Direct Human Pose Estimation with Transformers", arXiv, 2021 (The University of Adelaide). [Paper][PyTorch]
    • Skeletor: "Skeletor: Skeletal Transformers for Robust Body-Pose Estimation", arXiv, 2021 (University of Surrey). [Paper]
    • HandsFormer: "HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction", arXiv, 2021 (Graz University of Technology). [Paper]
    • TTP: "Test-Time Personalization with a Transformer for Human Pose Estimation", NeurIPS, 2021 (UCSD). [Paper][PyTorch][Website]
    • GraFormer: "GraFormer: Graph Convolution Transformer for 3D Pose Estimation", arXiv, 2021 (CAS). [Paper]
    • GCT: "Geometry-Contrastive Transformer for Generalized 3D Pose Transfer", AAAI, 2022 (University of Oulu). [Paper][PyTorch]
    • MHFormer: "MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation", CVPR, 2022 (Peking). [Paper][PyTorch]
    • PAHMT: "Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation", CVPR, 2022 (NetEase). [Paper]
    • TCFormer: "Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • PETR: "End-to-End Multi-Person Pose Estimation With Transformers", CVPR, 2022 (Hikvision). [Paper][PyTorch]
    • GraFormer: "GraFormer: Graph-Oriented Transformer for 3D Pose Estimation", CVPR, 2022 (CAS). [Paper]
    • Keypoint-Transformer: "Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation", CVPR, 2022 (Graz University of Technology, Austria). [Paper][PyTorch][Website]
    • MPS-Net: "Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video", CVPR, 2022 (Academia Sinica). [Paper][Website]
    • Ego-STAN: "Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation", CVPRW, 2022 (University of Waterloo, Canada). [Paper]
    • AggPose: "AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation", IJCAI, 2022 (Shenzhen Baoan Women’s and Childiren’s Hospital). [Paper][Code (in construction)]
    • MotionMixer: "MotionMixer: MLP-based 3D Human Body Pose Forecasting", IJCAI, 2022 (Ulm University, Germany). [Paper][Code (in construction)]
    • Jointformer: "Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation", ICPR, 2022 (Trinity College Dublin, Ireland). [Paper]
    • IVT: "IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation", ACMMM, 2022 (Baidu). [Paper]
    • FastMETRO: "Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers", ECCV, 2022 (POSTECH). [Paper][PyTorch][Website]
    • PPT: "PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation", ECCV, 2022 (UC Irvine). [Paper][PyTorch]
    • Poseur: "Poseur: Direct Human Pose Regression with Transformers", ECCV, 2022 (The University of Adelaide, Australia). [Paper]
    • ViTPose: "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation", NeurIPS, 2022 (The University of Sydney). [Paper][PyTorch]
    • Swin-Pose: "Swin-Pose: Swin Transformer Based Human Pose Estimation", arXiv, 2022 (UMass Lowell) [Paper]
    • HeadPosr: "HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders", arXiv, 2022 (ETHZ). [Paper]
    • CrossFormer: "CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation", arXiv, 2022 (Canberra University, Australia). [Paper]
    • VTP: "VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • FeatER: "FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER", CVPR, 2023 (UCF). [Paper][Code (in construction)][Website]
    • GraphMLP: "GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation", arXiv, 2022 (Peking University). [Paper]
    • siMLPe: "Back to MLP: A Simple Baseline for Human Motion Prediction", arXiv, 2022 (INRIA). [Paper][Pytorch]
    • Snipper: "Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet", arXiv, 2022 (University of Alberta, Canada). [Paper][PyTorch]
    • OTPose: "OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos", arXiv, 2022 (Korea University). [Paper]
    • PoseBERT: "PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling", arXiv, 2022 (NAVER). [Paper][PyTorch]
    • KOG-Transformer: "K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation", arXiv, 2022 (CAS). [Paper]
    • SoMoFormer: "SoMoFormer: Multi-Person Pose Forecasting with Transformers", arXiv, 2022 (Stanford). [Paper]
    • DPIT: "DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation", arXiv, 2022 (Shanghai University). [Paper]
    • Uplift-Upsample: "Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers", WACV, 2023 (University of Augsburg, Germany). [Paper][Tensorflow]
    • TORE: "TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer", ICCV, 2023 (HKU). [Paper][Code (in construction)][Website]
    • MPT: "MPT: Mesh Pre-Training with Transformers for Human Pose and Mesh Reconstruction", arXiv, 2022 (Microsoft). [Paper]
    • ViTPose+: "ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation", arXiv, 2022 (The University of Sydney). [Paper][PyTorch]
    • POT: "Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation", AAAI, 2023 (Shanghai Jiao Tong). [Paper]
    • INT: "Capturing the Motion of Every Joint: 3D Human Pose and Shape Estimation with Independent Tokens", ICLR, 2023 (Southeast University). [Paper]
    • TBIFormer: "Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting", CVPR, 2023 (Hangzhou Dianzi Universit). [Paper]
    • PSVT: "PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers", CVPR, 2023 (Baidu). [Paper]
    • PCT: "Human Pose as Compositional Tokens", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
    • OSX: "One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer", CVPR, 2023 (IDEA). [Paper][PyTorch][Website]
    • PoseFormerV2: "PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation", CVPR, 2023 (UCF). [Paper][PyTorch][Website][Website]
    • SA-HMR: "Learning Human Mesh Recovery in 3D Scenes", CVPR, 2023 (Zhejiang University). [Paper][Code (in construction)][Website]
    • DeFormer: "Deformable Mesh Transformer for 3D Human Mesh Recovery", CVPR, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper]
    • STCFormer: "3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention", CVPR, 2023 (Hefei University of Technology). [Paper]
    • DistilPose: "DistilPose: Tokenized Pose Regression With Heatmap Distillation", CVPR, 2023 (Tencent). [Paper][PyTorch]
    • LPFormer: "LPFormer: LiDAR Pose Estimation Transformer with Multi-Task Network", CVPRW, 2023 (UCF). [Paper]
    • LAMP: "LAMP: Leveraging Language Prompts for Multi-person Pose Estimation", IROS, 2023 (UCF). [Paper][PyTorch]
    • DiffPose: "DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation", ICCV, 2023 (Jilin University). [Paper]
    • JOTR: "JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery", ICCV, 2023 (Alibaba). [Paper]
    • GroupPose: "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation", ICCV, 2023 (Baidu). [Paper][Paddle][PyTorch]
    • CoordFormer: "Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos", ICCV, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
    • PoseFix: "PoseFix: Correcting 3D Human Poses with Natural Language", ICCV, 2023 (NAVER). [Paper][Website]
    • 4D-Humans: "Humans in 4D: Reconstructing and Tracking Humans with Transformers", ICCV, 2023 (Berkeley). [Paper][PyTorch][Website]
    • HopFIR: "HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation", ICCV, 2023 (Hefei University of Technology). [Paper]
    • HumanMAC: "HumanMAC: Masked Motion Completion for Human Motion Prediction", ICCV, 2023 (Tsinghua). [Paper][PyTorch][Website]
    • XFormer: "XFormer: Fast and Accurate Monocular 3D Body Capture", arXiv, 2023 (Huya Inc, China). [Paper]
    • PGformer: "PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction", arXiv, 2023 (Alibaba). [Paper]
    • ?: "Scene-aware Human Pose Generation using Transformer", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • HoT: "Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation", arXiv, 2023 (Peking). [Paper]
    • Pose-Anything: "Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation", arXiv, 2023 (Tel Aviv). [Paper][Code (in construction)][Website]
    • PoseGPT: "PoseGPT: Chatting about 3D Human Pose", arXiv, 2023 (MPI). [Paper][Code (in construction)][Website]
    • TEMP3D: "TEMP3D: Temporally Continuous 3D Human Pose Estimation Under Occlusions", arXiv, 2023 (UC Riverside). [Paper][Website]
    • FinePOSE: "FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models", CVPR, 2024 (University of Science and Technology Beijing). [Paper][PyTorch]
    • VLPose: "VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision Tuning", arXiv, 2024 (CUHK). [Paper]
    • ?: "Multi-Human Mesh Recovery with Transformers", arXiv, 2024 (Stanford). [Paper]
    • WHAC: "WHAC: World-grounded Humans and Cameras", arXiv, 2024 (SenseTime). [Paper][Code (in construction)][Website]
    • AiOS: "AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation", arXiv, 2024 (SenseTime). [Paper][Code (in construction)][Website]
    • EgoPoseFormer: "EgoPoseFormer: A Simple Baseline for Egocentric 3D Human Pose Estimation", arXiv, 2024 (Meta). [Paper]
  • Hands:
    • Hand-Transformer: "Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation", ECCV, 2020 (Kwai). [Paper]
    • SCAT: "SCAT: Stride Consistency With Auto-Regressive Regressor and Transformer for Hand Pose Estimation", ICCVW, 2021 (Alibaba). [Paper]
    • SeTHPose: "Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation", arXiv, 2022 (Queen's University, Canada). [Paper]
    • HTT: "Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos", CVPR, 2023 (HKU). [Paper][PyTorch][Website]
    • ?: "Image-free Domain Generalization via CLIP for 3D Hand Pose Estimation", arXiv, 2022 (UNIST, Korea). [Paper]
    • A2J-Transformer: "A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB Image", CVPR, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • H2OTR: "Transformer-Based Unified Recognition of Two Hands Manipulating Objects", CVPR, 2023 (Ulsan National Institute of Science & Technology (UNIST), Korea). [Paper]
    • Deformer: "Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation", ICCV, 2023 (CMU). [Paper][Code (in construction)][Website]
    • CLIP-Hand3D: "CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting", ACMMM, 2023 (Ocean University of China). [Paper]
  • Others:
    • TAPE: "Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry", arXiv, 2020 (Tianjing University). [Paper]
    • T6D-Direct: "T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression", GCPR, 2021 (University of Bonn). [Paper]
    • 6D-ViT: "6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning", arXiv, 2021 (University of Science and Technology of China). [Paper]
    • RayTran: "RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers", ECCV, 2022 (Google). [Paper]
    • DProST: "DProST: Dynamic Projective Spatial Transformer Network for 6D Pose Estimation", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
    • AFT-VO: "AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation", arXiv, 2022 (University of Surrey, UK). [Paper]
    • DPT-VO: "Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry", arXiv, 2022 (Aeronautics Institute of Technology, Brazil). [Paper]
    • ?: "Video based Object 6D Pose Estimation using Transformers", arXiv, 2022 (Georgia Tech). [Paper][PyTorch]
    • PoET: "PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation", arXiv, 2022 (Infineon Technologies Austria AG). [Paper][PyTorch]
    • CRT-6D: "CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers", WACV, 2023 (ICL, UK). [Paper][Code (in construction)]
    • TokenHPE: "TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers", CVPR, 2023 (Central China Normal University). [Paper][Code (in construction)]
    • CLAMP: "CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose", CVPR, 2023 (The University of Sydney). [Paper][Code (in construction)]
    • DFTr: "Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation", ICCV, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
    • c2f-MS-Trans: "Coarse-to-Fine Multi-Scene Pose Regression with Transformers", TPAMI, 2023 (Bar-Ilan University (BIU), Israel). [Paper]
    • TransPoser: "TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation", arXiv, 2023 (Kyoto University). [Paper]
    • RelPose++: "RelPose++: Recovering 6D Poses from Sparse-view Observations", arXiv, 2023 (CMU). [Paper][PyTorch][Website]
    • KDSM: "Language-driven Open-Vocabulary Keypoint Detection for Animal Body and Face", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • UniPose: "UniPose: Detecting Any Keypoints", arXiv, 2023 (IDEA). [Paper][Code (in construction)][Website]
    • SAM-6D: "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation", arXiv, 2023 (CUHK). [Paper][PyTorch (in construction)]
    • ?: "Open-vocabulary object 6D pose estimation", arXiv, 2023 (Fondazione Bruno Kessler (FBK), Italy). [Paper]
    • FoundationPose: "FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)][Website]

[Back to Overview]

Tracking

  • General:
    • TransTrack: "TransTrack: Multiple-Object Tracking with Transformer",arXiv, 2020 (HKU + ByteDance). [Paper][PyTorch]
    • TransformerTrack: "Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking", CVPR, 2021 (USTC). [Paper][PyTorch]
    • TransT: "Transformer Tracking", CVPR, 2021 (Dalian University of Technology). [Paper][PyTorch]
    • STARK: "Learning Spatio-Temporal Transformer for Visual Tracking", ICCV, 2021 (Microsoft). [Paper][PyTorch]
    • HiFT: "HiFT: Hierarchical Feature Transformer for Aerial Tracking", ICCV, 2021 (Tongji University). [Paper][PyTorch]
    • DTT: "High-Performance Discriminative Tracking With Transformers", ICCV, 2021 (CAS). [Paper]
    • DualTFR: "Learning Tracking Representations via Dual-Branch Fully Transformer Networks", ICCVW, 2021 (Microsoft). [Paper][PyTorch (in construction)]
    • TransCenter: "TransCenter: Transformers with Dense Queries for Multiple-Object Tracking", arXiv, 2021 (INRIA + MIT). [Paper]
    • TransMOT: "TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking", arXiv, 2021 (Microsoft). [Paper]
    • TREG: "Target Transformed Regression for Accurate Tracking", arXiv, 2021 (Nanjing University). [Paper][Code (in construction)]
    • TrTr: "TrTr: Visual Tracking with Transformer", arXiv, 2021 (University of Tokyo). [Paper][PyTorch]
    • RelationTrack: "RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation", arXiv, 2021 (Huazhong Univerisity of Science and Technology). [Paper]
    • SiamTPN: "Siamese Transformer Pyramid Networks for Real-Time UAV Tracking", WACV, 2022 (New York University). [Paper]
    • MixFormer: "MixFormer: End-to-End Tracking with Iterative Mixed Attention", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
    • ToMP: "Transforming Model Prediction for Tracking", CVPR, 2022 (ETHZ). [Paper][PyTorch]
    • GTR: "Global Tracking Transformers", CVPR, 2022 (UT Austin). [Paper][PyTorch]
    • UTT: "Unified Transformer Tracker for Object Tracking", CVPR, 2022 (Meta). [Paper][Code (in construction)]
    • MeMOT: "MeMOT: Multi-Object Tracking with Memory", CVPR, 2022 (Amazon). [Paper]
    • CSwinTT: "Transformer Tracking with Cyclic Shifting Window Attention", CVPR, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • STNet: "Spiking Transformers for Event-Based Single Object Tracking", CVPR, 2022 (Dalian University of Technology). [Paper]
    • TrackFormer: "TrackFormer: Multi-Object Tracking with Transformers", CVPR, 2022 (Facebook). [Paper][PyTorch]
    • SBT: "Correlation-Aware Deep Tracking", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • SparseTT: "SparseTT: Visual Tracking with Sparse Transformers", IJCAI, 2022 (Beihang University). [Paper][Code (in construction)]
    • AiATrack: "AiATrack: Attention in Attention for Transformer Visual Tracking", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • MOTR: "MOTR: End-to-End Multiple-Object Tracking with TRansformer", ECCV, 2022 (Megvii). [Paper][PyTorch]
    • SwinTrack: "SwinTrack: A Simple and Strong Baseline for Transformer Tracking", NeurIPS, 2022 (South China University of Technology). [Paper][PyTorch]
    • ModaMixer: "Divert More Attention to Vision-Language Tracking", NeurIPS, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
    • TransMOT: "Transformers for Multi-Object Tracking on Point Clouds", IV, 2022 (Bosch). [Paper]
    • TransT-M: "High-Performance Transformer Tracking", arXiv, 2022 (Dalian University of Technology). [Paper]
    • HCAT: "Efficient Visual Tracking via Hierarchical Cross-Attention Transformer", arXiv, 2022 (Dalian University of Technology). [Paper]
    • ?: "Keypoints Tracking via Transformer Networks", arXiv, 2022 (KAIST). [Paper][PyTorch]
    • TranSTAM: "Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
    • TransFiner: "TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking", arXiv, 2022 (China University of Geosciences). [Paper]
    • LPAT: "Local Perception-Aware Transformer for Aerial Tracking", arXiv, 2022 (Tongji University). [Paper][PyTorch]
    • TADN: "Transformer-based assignment decision network for multiple object tracking", arXiv, 2022 (National Technical University of Athens, Greece). [Paper][Code (in construction)]
    • Strong-TransCenter: "Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations", arXiv, 2022 (Tel-Aviv University). [Paper][PyTorch]
    • MQT: "End-to-end Tracking with a Multi-query Transformer", arXiv, 2022 (Oxford). [Paper]
    • ProContEXT: "ProContEXT: Exploring Progressive Context Transformer for Tracking", arXiv, 2022 (Alibaba). [Paper]
    • ?: "Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer", arXiv, 2022 (Sony). [Paper]
    • MOTRv2: "MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors", CVPR, 2023 (Megvii). [Paper][Pytorch]
    • ViPT: "Visual Prompt Multi-Modal Tracking", CVPR, 2023 (Dalian University of Technology). [Paper][PyTorch]
    • GRM: "Generalized Relation Modeling for Transformer Tracking", CVPR, 2023 (HKUST). [Paper][PyTorch]
    • DropMAE: "DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks", CVPR, 2023 (CUHK). [Paper][PyTorch]
    • OVTrack: "OVTrack: Open-Vocabulary Multiple Object Tracking", CVPR, 2023 (ETHZ). [Paper][Website]
    • SeqTrack: "SeqTrack: Sequence to Sequence Learning for Visual Object Tracking", CVPR, 2023 (Dalian University of Technology). [Paper][PyTorch]
    • TCOW: "Tracking through Containers and Occluders in the Wild", CVPR, 2023 (Columbia). [Paper][Code (in construction)][Website]
    • VideoTrack: "VideoTrack: Learning to Track Objects via Video Transformer", CVPR, 2023 (Microsoft). [Paper]
    • MAT: "Representation Learning for Visual Object Tracking by Masked Appearance Transfer", CVPR, 2023 (Dalian University of Technology). [Paper][PyTorch]
    • MeMOTR: "MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
    • ROMTrack: "Robust Object Modeling for Visual Tracking", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
    • HiT: "Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking", ICCV, 2023 (Dalian University of Technology). [Paper][PyTorch]
    • OC-MOT: "Object-Centric Multiple Object Tracking", ICCV, 2023 (Amazon). [Paper]
    • ColTrack: "Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking", ICCV, 2023 (ByteDance). [Paper][PyTorch]
    • MVT: "Mobile Vision Transformer-based Visual Object Tracking", BMVC, 2023 (Concordia University, Canada). [Paper][PyTorch]
    • MixFormerV2: "MixFormerV2: Efficient Fully Transformer Tracking", NeurIPS, 2023 (Nanjing University). [Paper][PyTorch]
    • MENDER: "Type-to-Track: Retrieve Any Object via Prompt-based Tracking", NeurIPS, 2023 (University of Arkansas). [Paper][Code][Website]
    • MOTRv3: "MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking", arXiv, 2023 (Megvii). [Paper]
    • OmniMotion: "Tracking Everything Everywhere All at Once", arXiv, 2023 (Cornell). [Paper][PyTorch][Website]
    • ?: "A Dual-Source Attention Transformer for Multi-Person Pose Tracking", arXiv, 2023 (University of Bonn, Germany). [Paper]
    • TAM: "Track Anything: Segment Anything Meets Videos", arXiv, 2023 (SUSTech). [Paper][PyTorch]
    • SAM-Track: "Segment and Track Anything", arXiv, 2023 (Zhejiang University). [Paper][PyTorch]
    • CoTracker: "CoTracker: It is Better to Track Together", arXiv, 2023 (Meta). [Paper][PyTorch]
    • OVTracktor: "Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models", arXiv, 2023 (CMU). [Paper][Website]
    • Un-Track: "Single-Model and Any-Modality for Video Object Tracking", arXiv, 2023 (University of Wurzburg (JMU), Germany). [Paper][Code (in construction)]
    • TAO-Amodal: "Tracking Any Object Amodally", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
    • ARTrackV2: "ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe", arXiv, 2023 (Xi'an Jiaotong University). [Paper][Website]
    • TrackGPT: "Tracking with Human-Intent Reasoning", arXiv, 2024 (Alibaba). [Paper][Code (in construction)]
    • SMAT: "Separable Self and Mixed Attention Transformers for Efficient Object Tracking", WACV, 2024 (Concordia University, Canada). [Paper][PyTorch]
    • ContrasTR: "Contrastive Learning for Multi-Object Tracking with Transformers", WACV, 2024 (KU Leuven). [Paper]
    • M3SOT: "M3SOT: Multi-frame, Multi-field, Multi-space 3D Single Object Tracking", AAAI, 2024 (Xidian University). [Paper]
    • EVPTrack: "Explicit Visual Prompts for Visual Object Tracking", AAAI, 2024 (Guangxi Normal University). [Paper]
    • OneTracker: "OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning", CVPR, 2024 (Fudan). [Paper]
    • SBT: "Correlation-Embedded Transformer Tracking: A Single-Branch Framework", arXiv, 2024 (SJTU). [Paper][PyTorch]
    • TAPTR: "TAPTR: Tracking Any Point with Transformers as Detection", arXiv, 2024 (IDEA). [Paper][Website]
    • DINO-Tracker: "DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video", arXiv, 2024 (Weizmann Institute of Science, Israel). [Paper][Website]
  • 3D:
    • PTT: "PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds", IROS, 2021 (Northeastern University). [Paper][PyTorch (in construction)]
    • LTTR: "3D Object Tracking with Transformer", BMVC, 2021 (Northeastern University, China). [Paper][Code (in construction)]
    • PTTR: "PTTR: Relational 3D Point Cloud Object Tracking with Transformer", CVPR, 2022 (Sensetime). [Paper][PyTorch]
    • STNet: "3D Siamese Transformer Network for Single Object Tracking on Point Clouds", ECCV, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
    • CMT: "CMT: Context-Matching-Guided Transformer for 3D Tracking in Point Clouds", ECCV, 2022 (USTC). [Paper]
    • PTT: "Real-time 3D Single Object Tracking with Transformer", TMM, 2022 (Northeastern University, China). [Paper][PyTorch]
    • InterTrack: "InterTrack: Interaction Transformer for 3D Multi-Object Tracking", arXiv, 2022 (University of Toronto). [Paper]
    • PTTR++: "Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
    • GLT-T: "GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds", AAAI, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
    • 3DMOTFormer: "3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking", ICCV, 2023 (University of Bonn, Germany). [Paper][PyTorch]
    • CiteTracker: "CiteTracker: Correlating Image and Text for Visual Tracking", ICCV, 2023 (Peng Cheng Lab). [Paper][PyTorch)]
    • MoMA-M3T: "Delving into Motion-Aware Matching for Monocular 3D Object Tracking", ICCV, 2023 (UC Merced). [Paper][Code (in construction)]
    • TrajectoryFormer: "TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses", ICCV, 2023 (CUHK). [Paper][Code (in construction)]
    • SyncTrack: "Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking", ICCV, 2023 (Zhejiang University). [Paper]
    • MBPTrack: "MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors", ICCV, 2023 (Tsinghua). [Paper]
    • DQTrack: "End-to-end 3D Tracking with Decoupled Queries", ICCV, 2023 (NVIDIA). [Paper][PyTorch]
    • GLT-T++: "GLT-T++: Global-Local Transformer for 3D Siamese Tracking with Ranking Loss", arXiv, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
    • BOTT: "BOTT: Box Only Transformer Tracker for 3D Object Tracking", arXiv, 2023 (Motional). [Paper]
    • ADA-Track: "ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association", CVPR, 2024 (Mercedes-Benz). [Paper][PyTorch]

[Back to Overview]

Re-ID

  • PAT: "Diverse Part Discovery: Occluded Person Re-Identification With Part-Aware Transformer", CVPR, 2021 (University of Science and Technology of China). [Paper]
  • HAT: "HAT: Hierarchical Aggregation Transformers for Person Re-identification", ACMMM, 2021 (Dalian University of Technology). [Paper]
  • TransReID: "TransReID: Transformer-based Object Re-Identification", ICCV, 2021 (Alibaba). [Paper][PyTorch]
  • APD: "Transformer Meets Part Model: Adaptive Part Division for Person Re-Identification", ICCVW, 2021 (Meituan). [Paper]
  • Pirt: "Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification", ACMMM, 2021 (Beihang University). [Paper]
  • TransMatcher: "Transformer-Based Deep Image Matching for Generalizable Person Re-identification", NeurIPS, 2021 (IIAI). [Paper][PyTorch]
  • STT: "Spatiotemporal Transformer for Video-based Person Re-identification", arXiv, 2021 (Beihang University). [Paper]
  • AAformer: "AAformer: Auto-Aligned Transformer for Person Re-Identification", arXiv, 2021 (CAS). [Paper]
  • TMT: "A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification", arXiv, 2021 (Dalian University of Technology). [Paper]
  • LA-Transformer: "Person Re-Identification with a Locally Aware Transformer", arXiv, 2021 (University of Maryland Baltimore County). [Paper]
  • DRL-Net: "Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification", arXiv, 2021 (Peking University). [Paper]
  • GiT: "GiT: Graph Interactive Transformer for Vehicle Re-identification", arXiv, 2021 (Huaqiao University). [Paper]
  • OH-Former: "OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification", arXiv, 2021 (Shanghaitech University). [Paper]
  • CMTR: "CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification", arXiv, 2021 (Beijing Jiaotong University). [Paper]
  • PFD: "Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer", AAAI, 2022 (Peking). [Paper][PyTorch]
  • NFormer: "NFormer: Robust Person Re-identification with Neighbor Transformer", CVPR, 2022 (University of Amsterdam, Netherlands). [Paper][Code (in construction)]
  • DCAL: "Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification", CVPR, 2022 (Advanced Micro Devices, China). [Paper]
  • CMT: " Cross-Modality Transformer for Visible-Infrared Person Re-identification", ECCV, 2022 (USTC). [Paper]
  • CAViT: "CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification", ECCV, 2022 (CAS). [Paper][PyTorch]
  • PiT: "Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval", IEEE Transactions on Industrial Informatics, 2022 (* Peking*). [Paper]
  • ?: "Motion-Aware Transformer For Occluded Person Re-identification", arXiv, 2022 (NetEase, China). [Paper]
  • PFT: "Short Range Correlation Transformer for Occluded Person Re-Identification", arXiv, 2022 (Nanjing University of Posts and Telecommunications). [Paper]
  • ?: "CLIP-Driven Fine-grained Text-Image Person Re-identification", arXiv, 2022 (Nanjing University of Science and Technology). [Paper]
  • SeqTR: "Sequential Transformer for End-to-End Person Search", arXiv, 2022 (East China Normal University). [Paper]
  • CLIP-ReID: "CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels", arXiv, 2022 (East China Normal University). [Paper]
  • TMGF: "Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification", WACVW, 2023 (Zhejiang University). [Paper][Code (in construction)]
  • PMT: "Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification", AAAI, 2023 (Jiangsu University). [Paper][Code (in construction)]
  • DC-Former: "DC-Former: Diverse and Compact Transformer for Person Re-Identification", AAAI, 2023 (Ant Group). [Paper][PyTorch]
  • PHA: "PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification", CVPR, 2023 (Beihang University). [Paper]
  • TranSG: "TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch]
  • UNIReID: "Towards Modality-Agnostic Person Re-Identification With Descriptive Query", CVPR, 2023 (Wuhan University). [Paper][PyTorch]
  • UniPT: "Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification", ICCV, 2023 (Baidu). [Paper][PyTorch]
  • PAT: "Part-Aware Transformer for Generalizable Person Re-identification", ICCV, 2023 (UESTC). [Paper][PyTorch]
  • HAP: "HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception", NeurIPS, 2023 (Baidu). [Paper][PyTorch][Website]
  • TP-TPS: "Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search", arXiv, 2023 (Tencent). [Paper]
  • PLIP: "PLIP: Language-Image Pre-training for Person Representation Learning", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][Code (in construction)]
  • SSCP: "Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection", arXiv, 2023 (Chongqing University of Posts and Telecommunications). [Paper]
  • PI-VL: "Exploring Part-Informed Visual-Language Learning for Person Re-Identification", arXiv, 2023 (iFLYTEK, China). [Paper]
  • TBPS-CLIP: "An Empirical Study of CLIP for Text-based Person Search", arXiv, 2023 (Soochow University, China). [Paper][PyTorch]
  • PersonMAE: "PersonMAE: Person Re-Identification Pre-Training with Masked AutoEncoders", arXiv, 2023 (Microsoft). [Paper]
  • TF-CLIP: "TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification", AAAI, 2024 (Dalian University of Technology). [Paper]
  • TOP-ReID: "TOP-ReID: Multi-spectral Object Re-Identification with Token Permutation", AAAI, 2024 (Dalian University of Technology). [Paper][PyTorch]
  • MP-ReID: "Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification", AAAI, 2024 (Eastern Institute of Technology, China). [Paper]
  • EDITOR: "Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification", CVPR, 2024 (Dalian University of Technology). [Paper][PyTorch]
  • VDT: "View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network", CVPR, 2024 (Sun Yat-Sen University). [Paper][PyTorch]
  • MLLM4Text-ReID: "Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID", CVPR, 2024 (South China University of Technology). [Paper][PyTorch]
  • AIO: "All in One Framework for Multimodal Re-identification in the Wild", CVPR, 2024 (Wuhan University). [Paper]

[Back to Overview]

Face

  • General:
    • FAU-Transformer: "Facial Action Unit Detection With Transformers", CVPR, 2021 (Rakuten Institute of Technology). [Paper]
    • TADeT: "Mitigating Bias in Visual Transformers via Targeted Alignment", BMVC, 2021 (Gerogia Tech). [Paper]
    • ViT-Face: "Face Transformer for Recognition", arXiv, 2021 (Beijing University of Posts and Telecommunications). [Paper]
    • FaceT: "Learning to Cluster Faces via Transformer", arXiv, 2021 (Alibaba). [Paper]
    • VidFace: "VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots", arXiv, 2021 (Zhejiang University). [Paper]
    • FAA: "Shuffle Transformer with Feature Alignment for Video Face Parsing", arXiv, 2021 (Tencent). [Paper]
    • FaRL: "General Facial Representation Learning in a Visual-Linguistic Manner", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • FaceFormer: "FaceFormer: Speech-Driven 3D Facial Animation with Transformers", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
    • PhysFormer: "PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer", CVPR, 2022 (University of Oulu, Finland). [Paper][PyTorch]
    • VTP: "Sub-word Level Lip Reading With Visual Attention", CVPR, 2022 (Oxford). [Paper]
    • Label2Label: "Label2Label: A Language Modeling Framework for Multi-Attribute Learning", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
    • FPVT: "Face Pyramid Vision Transformer", BMVC, 2022 (FloppyDisk.AI, Pakistan). [Paper][PyTorch][Website]
    • fViT: "Part-based Face Recognition with Vision Transformers", BMVC, 2022 (Queen Mary University of London). [Paper]
    • EventFormer: "EventFormer: AU Event Transformer for Facial Action Unit Event Detection", arXiv, 2022 (Peking). [Paper]
    • MFT: "Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers", arXiv, 2022 (SUNY Binghamton). [Paper]
    • VC-TRSF: "Self-supervised Video-centralised Transformer for Video Face Clustering", arXiv, 2022 (ICL). [Paper]
    • MARLIN: "MARLIN: Masked Autoencoder for facial video Representation LearnINg", CVPR, 2023 (Monash University, Australia). [Paper][PyTorch]
    • TransFace: "TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective", ICCV, 2023 (Alibaba). [Paper][PyTorch]
    • FaceXFormer: "FaceXFormer: A Unified Transformer for Facial Analysis", arXiv, 2024 (JHU). [Paper][PyTorch][Website]
    • Arc2Face: "Arc2Face: A Foundation Model of Human Faces", arXiv, 2024 (ICL). [Paper][PyTorch][Website]
  • Facial Landmark:
    • Clusformer: "Clusformer: A Transformer Based Clustering Approach to Unsupervised Large-Scale Face and Visual Landmark Recognition", CVPR, 2021 (VinAI Research, Vietnam). [Paper]
    • LOTR: "LOTR: Face Landmark Localization Using Localization Transformer", arXiv, 2021 (Sertis, Thailand). [Paper]
    • SLPT: "Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning", CVPR, 2022 (University of Technology Sydney). [Paper][PyTorch]
    • DTLD: "Towards Accurate Facial Landmark Detection via Cascaded Transformers", CVPR, 2022 (Samsung). [Paper]
    • RePFormer: "RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection", arXiv, 2022 (CUHK). [Paper]
  • Face Low-Level Vision:
    • Latent-Transformer: "A Latent Transformer for Disentangled Face Editing in Images and Videos", ICCV, 2021 (Institut Polytechnique de Paris). [Paper][PyTorch]
    • TANet: "TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network", arXiv, 2021 (Wuhan Institute of Technology). [Paper]
    • FAT: "Facial Attribute Transformers for Precise and Robust Makeup Transfer", WACV, 2022 (University of Rochester). [Paper]
    • SSAT: "SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal", AAAI, 2022 (Wuhan University). [Paper][PyTorch]
    • TransEditor: "TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing", CVPR, 2022 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • RestoreFormer: "RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs", CVPR, 2022 (HKU). [Paper]
    • HairCLIP: "HairCLIP: Design Your Hair by Text and Reference Image", CVPR, 2022 (USTC). [Paper][PyTorch]
    • AnyFace: "AnyFace: Free-style Text-to-Face Synthesis and Manipulation", CVPR, 2022 (CAS). [Paper]
    • CodeFormer: "Towards Robust Blind Face Restoration with Codebook Lookup Transformer", NeurIPS, 2022 (NTU, Singapore). [Paper][PyTorch (in construction)][Website]
    • Cycle-Text2Face: "Cycle Text2Face: Cycle Text-to-face GAN via Transformers", arXiv, 2022 (Shahed Univerisity, Iran). [Paper]
    • FaceFormer: "FaceFormer: Scale-aware Blind Face Restoration with Transformers", arXiv, 2022 (Tencent). [Paper]
    • text2StyleGAN: "Text-Free Learning of a Natural Language Interface for Pretrained Face Generators", arXiv, 2022 (Toyota Technological Institute, Chicago). [Paper][PyTorch]
    • ManiCLIP: "ManiCLIP: Multi-Attribute Face Manipulation from Text", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
    • FEAT: "FEAT: Face Editing with Attention", arXiv, 2022 (Shenzhen University). [Paper]
    • CoralStyleCLIP: "CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing", CVPR, 2023 (Adobe). [Paper]
    • CLIP2Protect: "CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search", CVPR, 2023 (MBZUAI). [Paper][Code (in construction)]
    • PATMAT: "PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting", ICCV, 2023 (CMU). [Paper]
    • HairCLIPv2: "HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending", ICCV, 2023 (USTC). [Paper][Code (in construction)]
    • RestoreFormer++: "RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs", arXiv, 2023 (HKU). [Paper]
  • Facial Expression:
    • TransFER: "TransFER: Learning Relation-aware Facial Expression Representations with Transformers", ICCV, 2021 (CAS). [Paper]
    • CVT-Face: "Robust Facial Expression Recognition with Convolutional Visual Transformers", arXiv, 2021 (Hunan University). [Paper]
    • MViT: "MViT: Mask Vision Transformer for Facial Expression Recognition in the wild", arXiv, 2021 (University of Science and Technology of China). [Paper]
    • ViT-SE: "Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition", arXiv, 2021 (CentraleSupélec, France). [Paper]
    • EST: "Expression Snippet Transformer for Robust Video-based Facial Expression Recognition", arXiv, 2021 (China University of Geosciences). [Paper][PyTorch]
    • MFEViT: "MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition", arXiv, 2021 (University of Science and Technology of China). [Paper]
    • F-PDLS: "Vision Transformer Equipped with Neural Resizer on Facial Expression Recognition Task", ICASSP, 2022 (KAIST). [Paper]
    • ?: "Transformer-based Multimodal Information Fusion for Facial Expression Analysis", arXiv, 2022 (Netease, China). [Paper]
    • ?: "Facial Expression Recognition with Swin Transformer", arXiv, 2022 (Dongguk University, Korea). [Paper]
    • POSTER: "POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition", arXiv, 2022 (UCF). [Paper]
    • STT: "Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild", arXiv, 2022 (Hunan University). [Paper]
    • FaceMAE: "FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • TransFA: "TransFA: Transformer-based Representation for Face Attribute Evaluation", arXiv, 2022 (Xidian University). [Paper]
    • AU-CVT: "AU-Supervised Convolutional Vision Transformers for Synthetic Facial Expression Recognition", arXiv, 2022 (Shenzhen Technology University). [Paper][PyTorch]
    • ?: "Multi-Task Transformer with uncertainty modelling for Face Based Affective Computing", arXiv, 2022 (Datakalab, France). [Paper]
    • APViT: "Vision Transformer with Attentive Pooling for Robust Facial Expression Recognition", arXiv, 2022 (Baidu). [Paper]
    • Micron-BERT: "Micron-BERT: BERT-based Facial Micro-Expression Recognition", CVPR, 2023 (University of Arkansas). [Paper][PyTorch (in construction)]
    • FRL-DGT: "Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition", CVPR, 2023 (Wuhan University). [Paper]
    • Text2Listen: "Can Language Models Learn to Listen?", ICCV, 2023 (Berkeley). [Paper][Website]
    • CLEF: "Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior Understanding", ICCV, 2023 (Binghamton University). [Paper]
    • EmoCLIP: "EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition", arXiv, 2023 (Queen Mary University of London). [Paper][PyTorch]
  • Attack-related:
    • ?: "Video Transformer for Deepfake Detection with Incremental Learning", ACMMM, 2021 (MBZUAI). [Paper]
    • ViTranZFAS: "On the Effectiveness of Vision Transformers for Zero-shot Face Anti-Spoofing", International Joint Conference on Biometrics (IJCB), 2021 (Idiap). [Paper]
    • MTSS: "Multi-Teacher Single-Student Visual Transformer with Multi-Level Attention for Face Spoofing Detection", BMVC, 2021 (National Taiwan Ocean University). [Paper]
    • TransRPPG: "TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection", arXiv, 2021 (University of Oulu). [Paper]
    • CViT: "Deepfake Video Detection Using Convolutional Vision Transformer", arXiv, 2021 (Jimma University). [Paper]
    • ViT-Distill: "Deepfake Detection Scheme Based on Vision Transformer and Distillation", arXiv, 2021 (Sookmyung Women’s University). [Paper]
    • M2TR: "M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection", arXiv, 2021 (Fudan University). [Paper]
    • Cross-ViT: "Combining EfficientNet and Vision Transformers for Video Deepfake Detection", arXiv, 2021 (University of Pisa). [Paper][PyTorch]
    • ICT: "Protecting Celebrities from DeepFake with Identity Consistency Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • GGViT: "GGViT: Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection", ICPR, 2022 (CAS). [Paper]
    • ?: "Hybrid Transformer Network for Deepfake Detection", International Conference on Content-Based Multimedia Indexing (CBMI), 2022 (MediaFutures, Norway). [Paper]
    • ViTAF: "Adaptive Transformers for Robust Few-shot Cross-domain Face Anti-spoofing", ECCV, 2022 (Google). [Paper]
    • UIA-ViT: "UIA-ViT: Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection", ECCV, 2022 (USTC). [Paper]
    • ?: "Multi-Scale Wavelet Transformer for Face Forgery Detection", ACCV, 2022 (Hikvision). [Paper]
    • ?: "Self-supervised Transformer for Deepfake Detection", arXiv, 2022 (USTC, China). [Paper]
    • ViTransPAD: "ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection", arXiv, 2022 (University of La Rochelle, France). [Paper]
    • ?: "Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection", arXiv, 2022 (National Research Council, Italy). [Paper]
    • STDT: "Deepfake Video Detection with Spatiotemporal Dropout Transformer", arXiv, 2022 (CAS). [Paper]
    • ?: "Deep Convolutional Pooling Transformer for Deepfake Detection", arXiv, 2022 (HKU). [Paper]
    • DGM4: "Detecting and Grounding Multi-Modal Media Manipulation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • FLIP: "FLIP: Cross-domain Face Anti-spoofing with Language Guidance", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
    • Face-Transformer: "Face Transformer: Towards High Fidelity and Accurate Face Swapping", arXiv, 2023 (NTU, Singapore). [Paper]
    • DGM4: "Detecting and Grounding Multi-Modal Media Manipulation and Beyond", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
    • AntifakePrompt: "AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors", arXiv, 2023 (NYCU). [Paper]
    • MMDG: "Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing", CVPR, 2024 (Beihang University). [Paper][Code (in construction)]
  • Fairness:
    • TADeT: "Mitigating Bias in Visual Transformers via Targeted Alignment", BMVC, 2021 (Georgia Tech). [Paper]
  • Generation:
    • Describe3D: "High-Fidelity 3D Face Generation from Natural Language Descriptions", CVPR, 2023 (Nanjing University). [Paper][PyTorch]
    • LipFormer: "LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook", CVPR, 2023 (Alibaba). [Paper]
    • ?: "High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning", CVPR, 2023 (Zhejiang University). [Paper]
  • 3D:
    • CodeTalker: "CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior", CVPR, 2023 (CUHK). [Paper][PyTorch][Website]
  • Age:
    • DAA: "DAA: A Delta Age AdaIN operation for age estimation via binary code transformer", CVPR, 2023 (Jiayu Intelligent Technology, China). [Paper][PyTorch]

[Back to Overview]

Neural Architecture Search

  • HR-NAS: "HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers", CVPR, 2021 (HKU). [Paper][PyTorch]
  • CATE: "CATE: Computation-aware Neural Architecture Encoding with Transformers", ICML, 2021 (Michigan State). [Paper]
  • AutoFormer: "AutoFormer: Searching Transformers for Visual Recognition", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • GLiT: "GLiT: Neural Architecture Search for Global and Local Image Transformer", ICCV, 2021 (The University of Sydney + SenseTime). [Paper]
  • BossNAS: "BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search", ICCV, 2021 (Monash University). [Paper][PyTorch]
  • ViT-ResNAS: "Searching for Efficient Multi-Stage Vision Transformers", ICCVW, 2021 (MIT). [Paper][PyTorch]
  • AutoformerV2: "Searching the Search Space of Vision Transformer", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
  • TNASP: "TNASP: A Transformer-based NAS Predictor with a Self-evolution Framework", NeurIPS, 2021 (CAS + Kuaishou). [Paper]
  • PSViT: "PSViT: Better Vision Transformer via Token Pooling and Attention Sharing", arXiv, 2021 (The University of Sydney + SenseTime). [Paper]
  • As-ViT: "Auto-scaling Vision Transformers without Training", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • NASViT: "NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training", ICLR, 2022 (Facebook). [Paper]
  • TF-TAS: "Training-free Transformer Architecture Search", CVPR, 2022 (Tencent). [Paper]
  • ViT-Slim: "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space", CVPR, 2022 (MBZUAI). [Paper][PyTorch]
  • BurgerFormer: "Searching for BurgerFormer with Micro-Meso-Macro Space Design", ICML, 2022 (CAS). [Paper][Code (in construction)]
  • UniNet: "UniNet: Unified Architecture Search with Convolution, Transformer, and MLP", ECCV, 2022 (CUHK + SenseTime). [Paper]
  • ViTAS: "Vision Transformer Architecture Search", ECCV, 2022 (The University of Sydney + SenseTime). [Paper]
  • VTCAS: "Vision Transformer with Convolutions Architecture Search", arXiv, 2022 (Donghua University). [Paper]
  • NOAH: "Neural Prompt Search", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
  • FocusFormer: "FocusFormer: Focusing on What We Need via Architecture Sampler", arXiv, 2022 (Monash University, Australia). [Paper]
  • NAR-Former: "NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction", CVPR, 2023 (Xidian University, China). [Paper][PyTorch]
  • MDL-NAS: "MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer", CVPR, 2023 (SenseTime). [Paper]
  • AutoTaskFormer: "AutoTaskFormer: Searching Vision Transformers for Multi-task Learning", arXiv, 2023 (Microsoft). [Paper]
  • GPT-NAS: "GPT-NAS: Neural Architecture Search with the Generative Pre-Trained Model", arXiv, 2023 (Sichuan University). [Paper]
  • NAR-Former-V2: "NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning", arXiv, 2023 (Intellifusion, China). [Paper]
  • AutoST: "AutoST: Training-free Neural Architecture Search for Spiking Transformers", arXiv, 2023 (NC State). [Paper]
  • TurboViT: "TurboViT: Generating Fast Vision Transformers via Generative Architecture Search", arXiv, 2023 (University of Waterloo). [Paper]
  • FLORA: "FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer", WACV, 2024 (NYCU). [Paper][PyTorch]
  • Auto-Prox: "Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery", AAAI, 2024 (National University of Defense Technology, China). [Paper][Code (in construction)]

[Back to Overview]

Scene Graph

  • BGT-Net: "BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation", CVPRW, 2021 (ETHZ). [Paper]
  • STTran: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", ICCV, 2021 (Leibniz University Hannover, Germany). [Paper][PyTorch]
  • SGG-NLS: "Learning to Generate Scene Graph from Natural Language Supervision", ICCV, 2021 (University of Wisconsin-Madison). [Paper][PyTorch]
  • SGG-Seq2Seq: "Context-Aware Scene Graph Generation With Seq2Seq Transformers", ICCV, 2021 (Layer 6 AI, Canada). [Paper][PyTorch]
  • RELAX: "Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs", BMVC, 2021 (Samsung). [Paper]
  • Relation-Transformer: "Scenes and Surroundings: Scene Graph Generation using Relation Transformer", arXiv, 2021 (LMU Munich). [Paper]
  • SGTR: "SGTR: End-to-end Scene Graph Generation with Transformer", CVPR, 2022 (ShanghaiTech). [Paper][Code (in construction)]
  • GCL: "Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation", CVPR, 2022 (Shandong University). [Paper][PyTorch]
  • Relationformer: "Relationformer: A Unified Framework for Image-to-Graph Generation", ECCV, 2022 (TUM). [Paper][Code (in construction)]
  • SVRP: "Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning", ECCV, 2022 (Monash University). [Paper]
  • RelTR: "RelTR: Relation Transformer for Scene Graph Generation", arXiv, 2022 (Leibniz University Hannover, Germany). [Paper][PyTorch]
  • SG-Shuffle: "SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation", arXiv, 2022 (The University of Sydney). [Paper]
  • IS-GGT: "Iterative Scene Graph Generation with Generative Transformers", CVPR, 2023 (Oklahoma State University). [Paper]
  • SQUAT: "Devil's on the Edges: Selective Quad Attention for Scene Graph Generation", CVPR, 2023 (POSTECH). [Paper][PyTorch][Website]
  • VS3: "Learning to Generate Language-supervised and Open-vocabulary Scene Graph using Pre-trained Visual-Semantic Space", CVPR, 2023 (CUHK). [Paper]
  • PVSG: "Panoptic Video Scene Graph Generation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
  • VETO-MEET: "Vision Relation Transformer for Unbiased Scene Graph Generation", ICCV, 2023 (Technical University of Darmstadt, Germany). [Paper][PyTorch]
  • TextPSG: "TextPSG: Panoptic Scene Graph Generation from Textual Descriptions", ICCV, 2023 (IBM). [Paper][Code (in construction)][Website]
  • HiLo: "HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation", ICCV, 2023 (King's College London). [Paper][PyTorch]
  • PSG4DFormer: "4D Panoptic Scene Graph Generation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
  • SGTR+: "SGTR+: End-to-end Scene Graph Generation with Transformer", TPAMI, 2023 (ShanghaiTech). [Paper][PyTorch]
  • SGT: "Revisiting Transformer for Point Cloud-based 3D Scene Graph Generation", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper]
  • EGTR: "EGTR: Extracting Graph from Transformer for Scene Graph Generation", CVPR, 2024 (NAVER). [Paper][Code (in construction)]

[Back to Overview]

Transfer / X-Supervised / X-Shot / Continual Learning

  • Transfer Learning/Adapter:
    • AdaptFormer: "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition", NeurIPS, 2022 (HKU). [Paper][PyTorch][Website]
    • Convpass: "Convolutional Bypasses Are Better Vision Transformer Adapters", arXiv, 2022 (Peking University). [Paper][Pytorch]
    • FacT: "FacT: Factor-Tuning for Lightweight Adaptation on Vision Transformer", AAAI, 2023 (Peking). [Paper][Pytorch]
    • Consolidator: "Consolidator: Mergable Adapter with Group Connections for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
    • REACT: "Learning Customized Visual Models with Retrieval-Augmented Knowledge", CVPR, 2023 (Microsoft). [Paper][Code (in construction)][Website]
    • MP: "Tuning Pre-trained Model via Moment Probing", ICCV, 2023 (Tianjin University). [Paper][PyTorch]
    • ARC: "Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing", NeurIPS, 2023 (Xi'an University of Architecture and Technology). [Paper][PyTorch]
    • Res-Tuning: "Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone", NeurIPS, 2023 (Alibaba). [Paper][Website]
    • E3VA: "Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions", arXiv, 2023 (Alibaba + Microsoft). [Paper]
    • Minimax: "Task-Robust Pre-Training for Worst-Case Downstream Adaptation", arXiv, 2023 (Peking). [Paper]
    • HST: "Hierarchical Side-Tuning for Vision Transformers", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
    • PELA: "PELA: Learning Parameter-Efficient Models with Low-Rank Approximation", arXiv, 2023 (NUS). [Paper][PyTorch]
    • Mona: "Adapter is All You Need for Tuning Visual Tasks", arXiv, 2023 (CAS). [Paper][PyTorch]
    • ?: "Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models", arXiv, 2023 (Apple). [Paper]
    • GIFT: "GIFT: Generative Interpretable Fine-Tuning Transformers", arXiv, 2023 (NC State). [Paper][Code (in construction)]
    • FAPFT: "Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers", arXiv, 2023 (Fudan). [Paper]
    • VMT-Adapter: "VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense", AAAI, 2024 (Tencent). [Paper]
    • VPTSP: "Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning", ICLR, 2024 (UCF). [Paper][Code (in construction)]
    • LORS: "LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking", CVPR, 2024 (Tencent). [Paper]
    • Dr2Net: "Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning", arXiv, 2024 (KAUST). [Paper]
    • ViSFT: "Supervised Fine-tuning in turn Improves Visual Foundation Models", arXiv, 2024 (Tencent). [Paper][PyTorch]
    • LAST: "Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning", arXiv, 2024 (Nanjing University). [Paper]
    • LoSA: "Time-, Memory- and Parameter-Efficient Visual Adaptation", arXiv, 2024 (Google). [Paper]
  • Domain Adaptation/Domain Generalization/Federated Learning:
    • TransDA: "Transformer-Based Source-Free Domain Adaptation", arXiv, 2021 (Haerbin Institute of Technology). [Paper][PyTorch]
    • ResTran: "Discovering Spatial Relationships by Transformers for Domain Generalization", arXiv, 2021 (MBZUAI). [Paper]
    • WinTR: "Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation", arXiv, 2021 (Beijing Institute of Technology). [Paper]
    • CDTrans: "CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation", ICLR, 2022 (Alibaba). [Paper][PyTorch]
    • SSRT: "Safe Self-Refinement for Transformer-based Domain Adaptation", CVPR, 2022 (Stony Brook). [Paper]
    • DOT: "Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation", ACMMM, 2022 (Beijing Institute of Technology). [Paper]
    • GVRT: "Grounding Visual Representations with Texts for Domain Generalization", ECCV, 2022 (LG). [Paper][PyTorch]
    • PACMAC: "Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency", NeurIPS, 2022 (Georgia Tech). [Paper][PyTorch]
    • ERM-ViT: "Self-Distilled Vision Transformer for Domain Generalization", ACCV, 2022 (MBZUAI). [Paper][PyTorch]
    • BCAT: "Domain Adaptation via Bidirectional Cross-Attention Transformer", arXiv, 2022 (Southern University of Science and Technology). [Paper]
    • DoTNet: "Towards Unsupervised Domain Adaptation via Domain-Transformer", arXiv, 2022 (Sun Yat-Sen University). [Paper]
    • TransDA: "Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation", arXiv, 2022 (Tsinghua). [Paper][PyTorch)]
    • FAMLP: "FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization", arXiv, 2022 (University of Science and Technology of China). [Paper]
    • DePT: "Visual Prompt Tuning for Test-time Domain Adaptation", arXiv, 2022 (Amazon). [Paper]
    • LADS: "Using Language to Extend to Unseen Domains", arXiv, 2022 (Berkeley). [Paper]
    • MetaPrompt: "Learning Domain Invariant Prompt for Vision-Language Models", arXiv, 2022 (Tongji University + Microsoft). [Paper]
    • TVT: "TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation", WACV, 2023 (UT Arlington + Kuaishou). [Paper][PyTorch]
    • GMoE: "Sparse Mixture-of-Experts are Domain Generalizable Learners", ICLR, 2023 (NTU, Singapore). [Paper]
    • PMTrans: "Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective", CVPR, 2023 (HKUST). [Paper][Website (in construction)]
    • ALOFT: "ALOFT: A Lightweight MLP-like Architecture with Dynamic Low-frequency Transform for Domain Generalization", CVPR, 2023 (Nanjing University). [Paper][PyTorch]
    • PromptStyler: "PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization", ICCV, 2023 (Agency for Defense Development, Korea). [Paper][Website]
    • DSiT: "Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation", ICCV, 2023 (Indian Institute of Science). [Paper][Website]
    • pFedPG: "Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation", ICCV, 2023 (NVIDIA). [Paper]
    • FedPerfix: "FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning", ICCV, 2023 (UCF). [Paper][Code (in construction)]
    • RISE: "A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance", ICCV, 2023 (UW-Madison). [Paper][PyTorch]
    • PØDA: "PØDA: Prompt-driven Zero-shot Domain Adaptation", ICCV, 2023 (INRIA). [Paper][PyTorch][Website]
    • AD-CLIP: "AD-CLIP: Adapting Domains in Prompt Space Using CLIP", ICCVW, 2023 (IIT Bombay). [Paper]
    • MPA: "Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation", NeurIPS, 2023 (Fudan University). [Paper]
    • FedCLIP: "FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning", arXiv, 2023 (CAS). [Paper][PyTorch]
    • UniOOD: "Universal Domain Adaptation from Foundation Models", arXiv, 2023 (South China University of Technology). [Paper][Code (in construction)]
    • PEST: "Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation", arXiv, 2023 (NTU, Singapore). [Paper]
    • ?: "Open-Set Domain Adaptation with Visual-Language Foundation Models", arXiv, 2023 (The University of Tokyo). [Paper]
    • VPA: "VPA: Fully Test-Time Visual Prompt Adaptation", arXiv, 2023 (Meta). [Paper]
    • FedTPG: "Text-driven Prompt Generation for Vision-Language Models in Federated Learning", arXiv, 2023 (Bosch). [Paper]
    • StyLIP: "StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization", WACV, 2024 (TUM). [Paper]
    • ReCLIP: "ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation", WACV, 2024 (Amazon). [Paper][PyTorch]
    • FedLGT: "Language-Guided Transformer for Federated Multi-Label Classification", AAAI, 2024 (NTU). [Paper][Code (in construction)]
    • FedAPT: "Cross-domain Federated Adaptive Prompt Tuning for CLIP", AAAI, 2024 (Fudan University). [Paper][PyTorch]
    • VDPG: "Adapting to Distribution Shift by Visual Domain Prompt Generation", ICLR, 2024 (University of Toronto). [Paper][PyTorch][Website]
    • UniMoS: "Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation", CVPR, 2024 (University of Electronic Science and Technology of China). [Paper][PyTorch]
    • DiPrompT: "DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning", CVPR, 2024 (SenseTime). [Paper]
    • ULDA: "Unified Language-driven Zero-shot Domain Adaptation", CVPR, 2024 (CUHK). [Paper]
    • LLaVO: "Large Language Models as Visual Cross-Domain Learners", arXiv, 2024 (Southern University of Science and Technology). [Paper][Website][PyTorch]
    • LaGTrAn: "Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos", arXiv, 2024 (UCSD). [Paper][Website]
    • DGMamba: "DGMamba: Domain Generalization via Generalized State Space Model", arXiv, 2024 (SJTU). [Paper][Code (in construction)]
  • X-Supervised:
    • Semiformer: "Semi-Supervised Vision Transformers", ECCV, 2022 (Fudan University). [Paper][PyTorch]
    • SVL-Adapter: "SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models", BMVC, 2022 (UCL). [Paper][Code (in construction)]
    • Semi-ViT: "Semi-supervised Vision Transformers at Scale", NeurIPS, 2022 (Amazon). [Paper]
    • DPT: "Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels", NeurIPS, 2023 (Renmin University of China). [Paper][PyTorch]
    • DiversitySSL: "On Pretraining Data Diversity for Self-Supervised Learning", arXiv, 2024 (KAUST). [Paper][Code (in construction)]
  • Zero-Shot:
    • ViT-ZSL: "Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning", IMVIP, 2021 (University of Exeter, UK). [Paper]
    • TransZero: "TransZero: Attribute-guided Transformer for Zero-Shot Learning", AAAI, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • ?: "Zero-shot Visual Commonsense Immorality Prediction", BMVC, 2022 (Korea University). [Paper][PyTorch]
    • I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
    • HRT: "Hybrid Routing Transformer for Zero-Shot Learning", arXiv, 2022 (Xidian University). [Paper]
    • CuPL: "What does a platypus look like? Generating customized prompts for zero-shot image classification", arXiv, 2022 (University of Washington). [Paper][PyTorch]
    • VL-Taboo: "VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models", arXiv, 2022 (Goethe University Frankfurt, Germany). [Paper][Code (in construction)]
    • CALIP: "CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention", arXiv, 2022 (Peking University). [Paper]
    • PromptCompVL: "Prompting Large Pre-trained Vision-Language Models For Compositional Concept Learning", arXiv, 2022 (Michigan State). [Paper]
    • MUST: "Masked Unsupervised Self-training for Zero-shot Image Classification", ICLR, 2023 (Salesforce). [Paper][PyTorch]
    • I2MVFormer: "I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification", CVPR, 2023 (ETHZ). [Paper]
    • ADE: "Learning Attention as Disentangler for Compositional Zero-shot Learning", CVPR, 2023 (HKU). [Paper][PyTorch][Website]
    • CHiLS: "CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets", ICML, 2023 (UCSD). [Paper][PyTorch]
    • CoT: "Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning", ICCV, 2023 (Yonsei). [Paper][PyTorch]
    • diffusion-classifier: "Your Diffusion Model is Secretly a Zero-Shot Classifier", ICCV, 2023 (CMU). [Paper][PyTorch][Website]
    • SuS-X: "SuS-X: Training-Free Name-Only Transfer of Vision-Language Models", ICCV, 2022 (Cambridge). [Paper][PyTorch]
    • ?: "Text-to-Image Diffusion Models are Zero-Shot Classifiers", NeurIPS, 2023 (DeepMind). [Paper]
    • AutoCLIP: "AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models", arXiv, 2023 (Bosch). [Paper]
    • MMPT: "Prompt Tuning for Zero-shot Compositional Learning", arXiv, 2023 (Samsung). [Paper]
    • ZSLViT: "Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning", CVPR, 2024 (MBZUAI). [Paper]
  • X-Shot:
    • CrossTransformer: "CrossTransformers: spatially-aware few-shot transfer", NeurIPS, 2020 (DeepMind). [Paper][Tensorflow]
    • URT: "A Universal Representation Transformer Layer for Few-Shot Image Classification", ICLR, 2021 (Mila). [Paper][PyTorch]
    • TRX: "Temporal-Relational CrossTransformers for Few-Shot Action Recognition", CVPR, 2021 (University of Bristol). [Paper][PyTorch]
    • Few-shot-Transformer: "Few-Shot Transformation of Common Actions into Time and Space", arXiv, 2021 (University of Amsterdam). [Paper]
    • HCTransformers: "Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning", CVPR, 2022 (Fudan University). [Paper][PyTorch]
    • HyperTransformer: "HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning", CVPR, 2022 (Google). [Paper][PyTorch][Website]
    • STRM: "Spatio-temporal Relation Modeling for Few-shot Action Recognition", CVPR, 2022 (MBZUAI). [Paper][PyTorch][Website]
    • HyperTransformer: "HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning", ICML, 2022 (Google). [Paper]
    • CPM: "Compound Prototype Matching for Few-shot Action Recognition", ECCV, 2022 (The University of Tokyo). [Paper]
    • SUN: "Self-Promoted Supervision for Few-Shot Transformer", ECCV, 2022 (Harbin Institute of Technology + NUS). [Paper][PyTorch]
    • tSF: "tSF: Transformer-Based Semantic Filter for Few-Shot Learning", ECCV, 2022 (Tencent). [Paper]
    • TransVLAD: "TransVLAD: Focusing on Locally Aggregated Descriptors for Few-Shot Learning", ECCV, 2022 (Southern University of Science and Technology, China). [Paper]
    • BaseTransformers: "BaseTransformers: Attention over base data-points for One Shot Learning", BMVC, 2022 (Dublin City University, Ireland). [Paper][PyTorch]
    • FPTrans: "Feature-Proxy Transformer for Few-Shot Segmentation", NeurIPS, 2022 (Baidu). [Paper][Code (in construction)]
    • MM-Former: "Mask Matching Transformer for Few-Shot Segmentation", NeurIPS, 2022 (Picsart). [Paper][PyTorch]
    • MG-ViT: "Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
    • QSFormer: "Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification", arXiv, 2022 (Anhui University). [Paper]
    • FS-CT: "Enhancing Few-shot Image Classification with Cosine Transformer", arXiv, 2022 (VinUniversity, Vietnam). [Paper][PyTorch]
    • CoCa-CNI: "Exploiting Category Names for Few-Shot Classification with Vision-Language Models", arXiv, 2022 (Google). [Paper]
    • SP: "Semantic Prompt for Few-Shot Image Recognition", CVPR, 2023 (USTC). [Paper]
    • SMKD: "Supervised Masked Knowledge Distillation for Few-Shot Transformers", CVPR, 2023 (Columbia). [Paper][PyTorch]
    • CST: "Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation", CVPR, 2023 (Meta). [Paper]
    • Hint-Aug: "Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning", CVPR, 2023 (Georgia Tech). [Paper]
    • ProD: "ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification", CVPR, 2023 (University of Technology Sydney). [Paper]
    • PVP: "PVP: Pre-trained Visual Parameter-Efficient Tuning", arXiv, 2023 (Defense Innovation Institute, China). [Paper]
    • AMU-Tuning: "AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning", CVPR, 2024 (Tianjin University). [Paper]
  • Continual Learning:
    • MEAT: "Meta-attention for ViT-backed Continual Learning", CVPR, 2022 (Zhejiang University). [Paper][Code (in construction)]
    • DyTox: "DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion", CVPR, 2022 (Sorbonne Universite, France). [Paper][PyTorch]
    • LVT: "Continual Learning With Lifelong Vision Transformer", CVPR, 2022 (The University of Sydney). [Paper]
    • L2P: "Learning to Prompt for Continual Learning", CVPR, 2022 (Google). [Paper][Tensorflow]
    • ?: "Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones", CVPRW, 2022 (Ca' Foscari University, Italy). [Paper][PyTorch]
    • ADA: "Continual Learning with Transformers for Image Classification", CVPRW, 2022 (Amazon). [Paper]
    • ?: "Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization", CVPRW, 2022 (Ca' Foscari University, Italy). [Paper]
    • DualPrompt: "DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning", ECCV, 2022 (Google). [Paper][Tensorflow]
    • CVT: "Online Continual Learning with Contrastive Vision Transformer", ECCV, 2022 (The University of Sydney). [Paper]
    • IncCLIP: "Generative Negative Text Replay for Continual Vision-Language Pretraining", ECCV, 2022 (ShanghaiTech). [Paper]
    • S-Prompts: "S-Prompts Learning with Pre-trained Transformers: An Occam's Razor for Domain Incremental Learning", NeurIPS, 2022 (Singapore Management University). [Paper]
    • ADA: "Memory Efficient Continual Learning with Transformers", NeurIPS, 2022 (Amazon). [Paper]
    • BMU-MoCo: "BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling", NeurIPS, 2022 (Renmin University of China). [Paper]
    • CLiMB: "CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks", NeurIPS (Datasets and Benchmarks), 2022 (USC). [Paper][PyTorch]
    • COLT: "Transformers Are Better Continual Learners", arXiv, 2022 (Hikvision). [Paper]
    • D3Former: "D3Former: Debiased Dual Distilled Transformer for Incremental Learning", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
    • Continual-CLIP: "CLIP model is an Efficient Continual Learner", arXiv, 2022 (MBZUAI). [Paper][Code (in construction)]
    • GCAB-CFDC: "Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free Continual Learning of Vision Transformers", arXiv, 2022 (University of Pavia, Italy). [Paper][Code (in construction)]
    • PIVOT: "PIVOT: Prompting for Video Continual Learning", arXiv, 2022 (KAUST). [Paper]
    • AttriCLIP: "AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning", CVPR, 2023 (Beihang University). [Paper][PyTorch]
    • DKT: "DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning", CVPR, 2023 (Xi'an Jiaotong). [Paper][Code (in construction)]
    • CODA-Prompt: "CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning", CVPR, 2023 (IBM). [Paper][PyTorch]
    • BiRT: "BiRT: Bio-inspired Replay in Vision Transformers for Continual Learning", ICML, 2023 (NavInfo, Netherlands). [Paper]
    • CLR: "CLR: Channel-wise Lightweight Reprogramming for Continual Learning", ICCV, 2023 (USC). [Paper][Code (in construction)]
    • CTP: "CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation", ICCV, 2023 (Peng Cheng Lab). [Paper][PyTorch]
    • APG: "When Prompt-based Incremental Learning Does Not Meet Strong Pretraining", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
    • MAE-CIL: "Masked Autoencoders are Efficient Class Incremental Learners", ICCV, 2023 (Nankai University). [Paper][PyTorch (in construction)]
    • ConTraCon: "Exemplar-Free Continual Transformer with Convolutions", ICCV, 2023 (IIT Kharagpur). [Paper][PyTorch][Website]
    • LGCL: "Introducing Language Guidance in Prompt-based Continual Learning", ICCV, 2023 (RPTU Kaiserslautern, Germany). [Paper]
    • MVP: "Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning", ICCV, 2023 (Kyung Hee University, Korea). [Paper][PyTorch]
    • ZSCL: "Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models", ICCV, 2023 (NUS). [Paper][PyTorch]
    • C-LN: "On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers", ICCVW, 2023 (University of Trento). [Paper][PyTorch]
    • PromptFusion: "PromptFusion: Decoupling Stability and Plasticity for Continual Learning", arXiv, 2023 (Fudan). [Paper]
    • MSc-iNCD: "Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery", arXiv, 2023 (University of Trento, Italy). [Paper][PyTorch (in construction)]
    • PROOF: "Learning without Forgetting for Vision-Language Models", arXiv, 2023 (Nanjing University). [Paper]
    • HePCo: "HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning", arXiv, 2023 (Georgia Tech). [Paper]
    • ?: "Continual Learning in Open-vocabulary Classification with Complementary Memory Systems", arXiv, 2023 (UIUC). [Paper]
    • MoP-CLIP: "MoP-CLIP: A Mixture of Prompt-Tuned CLIP Models for Domain Incremental Learning", arXiv, 2023 (ETS Montreal, Canada). [Paper]
    • TiC-CLIP: "TiC-CLIP: Continual Training of CLIP Models", arXiv, 2023 (Apple). [Paper]
    • TIER: "Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning", AAAI, 2024 (Peking). [Paper][Code (in construction)]
    • OVOR: "OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning", ICLR, 2024 (JPMorgan Chase). [Paper]
    • CPrompt: "Consistent Prompting for Rehearsal-Free Continual Learning", CVPR, 2024 (Sun Yat-sen University). [Paper][PyTorch]
    • GMM: "Generative Multi-modal Models are Good Class-Incremental Learners", CVPR, 2024 (Nankai University). [Paper][PyTorch]
    • GS-LoRA: "Continual Forgetting for Pre-trained Vision Models", CVPR, 2024 (CAS). [Paper][PyTorch]
    • MoE-Adapters: "Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters", CVPR, 2024 (Dalian University of Technology). [Paper][PyTorch]
    • ConvPrompt: "Convolutional Prompting meets Language Models for Continual Learning", CVPR, 2024 (IIT Kharagpur). [Paper][Website]
    • PriViLege: "Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners", CVPR, 2024 (Kyung Hee University). [Paper][Code (in construction)]
    • Multi-LANE: "Less is more: Summarizing Patch Tokens for efficient Multi-Label Class-Incremental Learning", Conference on Lifelong Learning Agents (CoLLAs), 2024 (University of Trento). [Paper][PyTorch]
  • Long-tail/Imbalanced:
    • BatchFormer: "BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
    • BatchFormerV2: "BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning", arXiv, 2022 (The University of Sydney). [Paper]
    • LPT: "LPT: Long-tailed Prompt Tuning for Image Classification", ICLR, 2023 (Harbin Institute of Technology). [Paper]
    • PDC: "Rethink Long-tailed Recognition with Vision Transforms", ICASSP, 2023 (Tsinghua University). [Paper]
    • ?: "Exploring Vision-Language Models for Imbalanced Learning", arXiv, 2023 (Peking University). [Paper][PyTorch]
    • LMPT: "LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition", arXiv, 2023 (Monash University, Australia). [Paper][PyTorch]
    • LTGC: "LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content", CVPR, 2024 (Beijing University of Chemical Technology). [Paper]
    • DeiT-LT: "DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets", CVPR, 2024 (Indian Institute of Science, India). [Paper][PyTorch][Website]
  • Knowledge Distillation:
    • ?: "Knowledge Distillation via the Target-aware Transformer", CVPR, 2022 (Alibaba). [Paper]
    • DearKD: "DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers", CVPR, 2022 (JD). [Paper]
    • AttnDistill: "Attention Distillation: self-supervised vision transformer students need more guidance", BMVC, 2022 (UAB, Spain). [Paper][PyTorch]
    • ViTKD: "ViTKD: Practical Guidelines for ViT feature knowledge distillation", arXiv, 2022 (IDEA). [Paper][PyTorch (in construction)]
    • ?: "Adaptive Attention Link-based Regularization for Vision Transformers", arXiv, 2022 (Chung-Ang University, Korea). [Paper]
    • LiVT: "Learning Imbalanced Data with Vision Transformers", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
    • G2SD: "Generic-to-Specific Distillation of Masked Autoencoders", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • SLaK: "Are Large Kernels Better Teachers than Transformers for ConvNets?", ICML, 2023 (Eindhoven University of Technology, Netherlands). [[Paper] (https://arxiv.org/abs/2305.19412)][PyTorch]
    • CSKD: "Cumulative Spatial Knowledge Distillation for Vision Transformers", ICCV, 2023 (Megvii). [Paper]
    • TinyCLIP: "TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance", ICCV, 2023 (Microsoft). [Paper][PyTorch]
    • DIME-FM: "DIME-FM: DIstilling Multimodal and Efficient Foundation Models", ICCV, 2023 (Meta). [Paper]
    • MaskedKD: "MaskedKD: Efficient Distillation of Vision Transformers with Masked Images", arXiv, 2023 (POSTECH). [Paper]
    • AM-RADIO: "AM-RADIO: Agglomerative Model -- Reduce All Domains Into One", arXiv, 2023 (NVIDIA). [Paper]
  • Clustering:
    • VTCC: "Vision Transformer for Contrastive Clustering", arXiv, 2022 (Sun Yat-sen University, China). [Paper]
  • Novel Category Discovery:
    • PromptCAL: "PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
    • CLIP-GCD: "CLIP-GCD: Simple Language Guided Generalized Category Discovery", arXiv, 2023 (Georgia Tech). [Paper]

[Back to Overview]

Low-level Vision Tasks

Image Restoration

  • General:
    • NLRN: "Non-Local Recurrent Network for Image Restoration", NeurIPS, 2018 (UIUC). [Paper][Tensorflow]
    • RNAN: "Residual Non-local Attention Networks for Image Restoration", ICLR, 2019 (Northeastern University). [Paper][PyTorch]
    • PANet: "Pyramid Attention Networks for Image Restoration", arXiv, 2020 (UIUC). [Paper][PyTorch]
    • IPT: "Pre-Trained Image Processing Transformer", CVPR, 2021 (Huawei). [Paper][PyTorch (in construction)]
    • SwinIR: "SwinIR: Image Restoration Using Swin Transformer", ICCVW, 2021 (ETHZ). [Paper][PyTorch]
    • SiamTrans: "SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers", AAAI, 2022 (Huawei). [Paper]
    • Uformer: "Uformer: A General U-Shaped Transformer for Image Restoration", CVPR, 2022 (University of Science and Technology of China). [Paper][PyTorch]
    • MAXIM: "MAXIM: Multi-Axis MLP for Image Processing", CVPR, 2022 (Google). [Paper][Tensorflow]
    • Restormer: "Restormer: Efficient Transformer for High-Resolution Image Restoration", CVPR, 2022 (IIAI, UAE). [Paper][PyTorch]
    • TransWeather: "TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions", CVPR, 2022 (JHU). [Paper][PyTorch][Website]
    • KiT: "KNN Local Attention for Image Restoration", CVPR, 2022 (Yonsei University). [Paper]
    • ELMformer: "ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer", ACMMM, 2022 (Horizon Robotics). [Paper][Code (in construction)]
    • EDT: "On Efficient Transformer-Based Image Pre-training for Low-Level Vision", arXiv, 2022 (CUHK). [Paper][PyTorch]
    • ?: "Transform your Smartphone into a DSLR Camera: Learning the ISP in the Wild", arXiv, 2022 (ETHZ). [Paper]
    • TMT: "Imaging through the Atmosphere using Turbulence Mitigation Transformer", arXiv, 2022 (Purdue). [Paper][Code (in construction)][Website]
    • LRT: "LRT: An Efficient Low-Light Restoration Transformer for Dark Light Field Images", arXiv, 2022 (HKU). [Paper]
    • ART: "Accurate Image Restoration with Attention Retractable Transformer", ICLR, 2023 (Shanghai Jiao Tong University). [Paper][PyTorch]
    • Burstormer: "Burstormer: Burst Image Restoration and Enhancement Transformer", CVPR, 2023 (MBZUAI). [Paper][Code (in construction)]
    • ?: "Comprehensive and Delicate: An Efficient Transformer for Image Restoration", CVPR, 2023 (Sichuan University). [Paper]
    • ShuffleFormer: "Random Shuffle Transformer for Image Restoration", ICML, 2023 (USTC). [Paper][PyTorch (in construction)]
    • PromptIR: "PromptIR: Prompting for All-in-One Blind Image Restoration", NeurIPS, 2023 (MBZUAI). [Paper][PyTorch]
    • UCDIR: "A Unified Conditional Framework for Diffusion-based Image Restoration", NeurIPS, 2023 (CUHK). [Paper][Code (in construction)][Website]
    • MAEIP: "Masked Autoencoders as Image Processors", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
    • RAMiT: "RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration", arXiv, 2023 (Sogang University, Korea). [Paper]
    • RAP: "Restore Anything Pipeline: Segment Anything Meets Image Restoration", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
    • ProRes: "ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration", arXiv, 2023 (Horizon Robotics). [Paper][PyTorch (in construction)]
    • C2F-DFT: "Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration", arXiv, 2023 (Dalian University of Technology). [Paper][PyTorch (in construction)]
    • AutoDIR: "AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion", arXiv, 2023 (CUHK). [Paper]
    • MPerceiver: "Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration", arXiv, 2023 (CAS). [Paper]
    • TIP: "TIP: Text-Driven Image Processing with Semantic and Restoration Instructions", arXiv, 2023 (Google). [Paper][Website]
    • DA-CLIP: "Controlling Vision-Language Models for Universal Image Restoration", ICLR, 2024 (Uppsala University, Sweden). [Paper][PyTorch][Website]
    • VmambaIR: "VmambaIR: Visual State Space Model for Image Restoration", arXiv, 2024 (ByteDance). [Paper][Code (in construction)]
    • DyNet: "Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration", arXiv, 2024 (MBZUAI). [Paper][Code (in construction)]
    • LIPT: "LIPT: Latency-aware Image Processing Transformer", arXiv, 2024 (Huawei). [Paper]
  • Super-Resolution:
    • SAN: "Second-Order Attention Network for Single Image Super-Resolution", CVPR, 2019 (Tsinghua). [Paper][PyTorch]
    • CS-NL: "Image Super-Resolution with Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining", CVPR, 2020 (UIUC). [Paper][PyTorch]
    • TTSR: "Learning Texture Transformer Network for Image Super-Resolution", CVPR, 2020 (Microsoft). [Paper][PyTorch]
    • HAN: "Single Image Super-Resolution via a Holistic Attention Network", ECCV, 2020 (Northeastern University). [Paper][PyTorch]
    • NLSN: "Image Super-Resolution With Non-Local Sparse Attention", CVPR, 2021 (UIUC). [Paper]
    • ITSRN: "Implicit Transformer Network for Screen Content Image Continuous Super-Resolution", NeurIPS, 2021 (Tianjin University). [Paper][PyTorch]
    • FPAN: "Feedback Pyramid Attention Networks for Single Image Super-Resolution", arXiv, 2021 (Nanjing University of Science and Technology). [Paper]
    • ESRT: "Efficient Transformer for Single Image Super-Resolution", arXiv, 2021 (Peking University). [Paper]
    • Fusformer: "Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
    • DPT: "Detail-Preserving Transformer for Light Field Image Super-Resolution", AAAI, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
    • BSRT: "BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment", CVPRW, 2022 (Megvii). [Paper][PyTorch]
    • TATT: "A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
    • LBNet: "Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer", IJCAI, 2022 (Nanjing University of Posts and Telecommunications). [Paper][PyTorch (in construction)]
    • DATSR: "Reference-based Image Super-Resolution with Deformable Attention Transformer", ECCV, 2022 (ETHZ). [Paper][Code (in construction)]
    • ELAN: "Efficient Long-Range Attention Network for Image Super-resolution", ECCV, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
    • Swin2SR: "Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration", ECCVW, 2022 (University of Wurzburg, Germany). [Paper]
    • CAT: "Cross Aggregation Transformer for Image Restoration", NeurIPS, 2022 (Shanghai Jiao Tong). [Paper][PyTorch]
    • Stoformer: "Stochastic Window Transformer for Image Restoration", NeurIPS, 2022 (USTC). [Paper][PyTorch]
    • LFT: "Light Field Image Super-Resolution with Transformers", IEEE Signal Processing Letters, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
    • ELAN: "Efficient Long-Range Attention Network for Image Super-resolution", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
    • ACT: "Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution", arXiv, 2022 (LG). [Paper]
    • HIPA: "HIPA: Hierarchical Patch Transformer for Single Image Super Resolution", arXiv, 2022 (CUHK). [Paper]
    • CTCNet: "CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution", arXiv, 2022 (Nanjing University of Posts and Telecommunications). [Paper]
    • ShuffleMixer: "ShuffleMixer: An Efficient ConvNet for Image Super-Resolution", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
    • HST: "HST: Hierarchical Swin Transformer for Compressed Image Super-resolution", ECCVW, 2022 (USTC). [Paper]
    • SwinFIR: "SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution", arXiv, 2022 (Samsung). [Paper]
    • ITSRN++: "ITSRN++: Stronger and Better Implicit Transformer Network for Continuous Screen Content Image Super-Resolution", arXiv, 2022 (Tianjin University). [Paper]
    • NGswin: "N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution", CVPR, 2023 (Sogang University, Korea). [Paper][PyTorch]
    • OSRT: "OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer", CVPR, 2023 (CAS). [Paper]
    • HAT: "Activating More Pixels in Image Super-Resolution Transformer", CVPR, 2023 (University of Macau). [Paper][PyTorch]
    • CLIT: "Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution", CVPR, 2023 (MediaTek). [Paper]
    • CiaoSR: "CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution", CVPR, 2023 (ETHZ). [Paper][PyTorch]
    • HTCAN: "Hybrid Transformer and CNN Attention Network for Stereo Image Super-resolution", CVPRW, 2023 (ByteDance). [Paper]
    • DAT: "Dual Aggregation Transformer for Image Super-Resolution", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
    • CRAFT: "Feature Modulation Transformer: Cross-Refinement of Global Representation via High-Frequency Prior for Image Super-Resolution", ICCV, 2023 (UESTC). [Paper][Code (in construction)]
    • ESSAformer: "ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolution", ICCV, 2023 (Xidian University). [Paper][PyTorch]
    • SRFormer: "SRFormer: Permuted Self-Attention for Single Image Super-Resolution", ICCV, 2023 (Nankai University). [Paper]
    • ResShift: "ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • RGT: "Recursive Generalization Transformer for Image Super-Resolution", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • SOSR: "SOSR: Source-Free Image Super-Resolution with Wavelet Augmentation Transformer", arXiv, 2023 (CAS). [Paper]
    • HAT: "HAT: Hybrid Attention Transformer for Image Restoration", arXiv, 2023 (University of Macau). [Paper][PyTorch]
    • PromptSR: "Image Super-Resolution with Text Prompt Diffusion", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
    • Inf-DiT: "Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer", arXiv, 2024 (Zhipu AI). [Paper][Code (in construction)]
  • Denoise:
    • CharFormer: "CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising", ACMMM, 2022 (Jilin University). [Paper][PyTorch (in construction)]
    • DenSformer: "Dense residual Transformer for image denoising", arXiv, 2022 (University of Science and Technology Beijing). [Paper]
    • PoCoformer: "Polarized Color Image Denoising using Pocoformer", arXiv, 2022 (The University of Tokyo). [Paper]
    • DnSwin: "DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer", arXiv, 2022 (Guangdong University of Technology). [Paper]
    • SST: "Spatial-Spectral Transformer for Hyperspectral Image Denoising", arXiv, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
    • MaskedDenoising: "Masked Image Training for Generalizable Deep Image Denoising", CVPR, 2023 (HKUST). [Paper][Code (in construction)]
    • SERT: "Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising", CVPR, 2023 (Beijing Institute of Technology). [Paper][PyTorch]
    • HSDT: "Hybrid Spectral Denoising Transformer with Guided Attention", ICCV, 2023 (Beijing Institute of Technology). [Paper][PyTorch]
    • Xformer: "Xformer: Hybrid X-Shaped Transformer for Image Denoising", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • CLIPDenoising: "Transfer CLIP for Generalizable Image Denoising", CVPR, 2024 (Huazhong University of Science and Technology (HUST)). [Paper]
  • Others:
    • SDNet: "SDNet: multi-branch for single image deraining using swin", arXiv, 2021 (Xinjiang University). [Paper][Code (in construction)]
    • ATTSF: "Attention! Stay Focus!", arXiv, 2021 (BridgeAI, Seoul). [Paper][Tensorflow]
    • HyLoG-ViT: "Hybrid Local-Global Transformer for Image Dehazing", arXiv, 2021 (Beihang University). [Paper]
    • HyperTransformer: "HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening", CVPR, 2022 (JHU). [Paper][PyTorch]
    • DeHamer: "Image Dehazing Transformer With Transmission-Aware 3D Position Embedding", CVPR, 2022 (Nankai University). [Paper][Website]
    • PTNet: "Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal", ACMMM, 2022 (Fudan University). [Paper]
    • TurbNet: "Single Frame Atmospheric Turbulence Mitigation: A Benchmark Study and A New Physics-Inspired Transformer Model", ECCV, 2022 (Purdue + UT Austin). [Paper][PyTorch]
    • Stripformer: "Stripformer: Strip Transformer for Fast Image Deblurring", ECCV, 2022 (NTHU). [Paper]
    • DehazeFormer: "Vision Transformers for Single Image Dehazing", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • RSTCANet: "Residual Swin Transformer Channel Attention Network for Image Demosaicing", arXiv, 2022 (Tampere University, Finland). [Paper]
    • DRT: "DRT: A Lightweight Single Image Deraining Recursive Transformer", arXiv, 2022 (ANU, Australia). [Paper][PyTorch (in construction)]
    • Cubic-Mixer: "UHD Image Deblurring via Multi-scale Cubic-Mixer", arXiv, 2022 (Nanjing University of Science and Technology). [Paper]
    • MSP-Former: "MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing", arXiv, 2022 (Jimei University). [Paper]
    • ELF: "Magic ELF: Image Deraining Meets Association Learning and Transformer", arXiv, 2022 (Wuhan University). [Paper][PyTorch (in construction)]
    • SnowFormer: "SnowFormer: Scale-aware Transformer via Context Interaction for Single Image Desnowing", arXiv, 2022 (Jimei University, China). [Paper]
    • DMTNet: "DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer", arXiv, 2022 (Samsung). [Paper]
    • LMQFormer: "LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for Lightweight Snow Removal", arXiv, 2022 (Fuzhou University). [Paper]
    • Semi-UFormer: "Semi-UFormer: Semi-supervised Uncertainty-aware Transformer for Image Dehazing", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
    • WITT: "WITT: A Wireless Image Transmission Transformer for Semantic Communications", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][Code (in construction)]
    • BiT: "Blur Interpolation Transformer for Real-World Motion from Blur", CVPR, 2023 (The University of Tokyo). [Paper][PyTorch][Website]
    • DRSformer: "Learning A Sparse Transformer Network for Effective Image Deraining", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
    • FFTformer: "Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
    • MB-TaylorFormer: "MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
    • UDR-S2Former: "Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal of Raindrops and Rain Streaks", ICCV, 2023 (HKUST). [Paper][PyTorch][Website]
    • HI-Diff: "Hierarchical Integration Diffusion Model for Realistic Image Deblurring", NeurIPS, 2023 (SJTU). [Paper][PyTorch]
    • SelfPromer: "SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper]
    • ?: "A Data-Centric Solution to NonHomogeneous Dehazing via Vision Transformer", arXiv, 2023 (McMaster University, Canada). [Paper][PyTorch]

[Back to Overview]

Video Restoration

  • VSR-Transformer: "Video Super-Resolution Transformer", arXiv, 2021 (ETHZ). [Paper][PyTorch]
  • MANA: "Memory-Augmented Non-Local Attention for Video Super-Resolution", CVPR, 2022 (JD). [Paper]
  • ?: "Bringing Old Films Back to Life", CVPR, 2022 (Microsoft). [Paper][Code (in construction)]
  • TTVSR: "Learning Trajectory-Aware Transformer for Video Super-Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Trans-SVSR: "A New Dataset and Transformer for Stereoscopic Video Super-Resolution", CVPR, 2022 (Bahcesehir University, Turkey). [Paper][PyTorch]
  • STDAN: "STDAN: Deformable Attention Network for Space-Time Video Super-Resolution", CVPRW, 2022 (Tsinghua). [Paper]
  • VRT: "VRT: A Video Restoration Transformer", arXiv, 2022 (ETHZ). [Paper][PyTorch]
  • FGST: "Flow-Guided Sparse Transformer for Video Deblurring", ICML, 2022 (Tsinghua). [Paper][Code (in construction)]
  • RSTT: "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • FTVSR: "Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • EFNet: "Event-Based Fusion for Motion Deblurring with Cross-modal Attention", ECCV, 2022 (ETHZ). [Paper]
  • TempFormer: "TempFormer: Temporally Consistent Transformer for Video Denoising", ECCV, 2022 (Disney). [Paper]
  • RVRT: "Recurrent Video Restoration Transformer with Guided Deformable Attention", NeurIPS, 2022 (ETHZ). [Paper][PyTorch]
  • ?: "Rethinking Alignment in Video Super-Resolution Transformers", NeurIPS, 2022 (Shanghai AI Lab). [Paper][PyTorch]
  • VDTR: "VDTR: Video Deblurring with Transformer", arXiv, 2022 (Tsinghua). [Paper][Code (in construction)]
  • DSCT: "Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
  • Group-ShiftNet: "No Attention is Needed: Grouped Spatial-temporal Shift for Simple and Efficient Video Restorers", arXiv, 2022 (CUHK). [Paper][Code (in construction)][Website]

[Back to Overview]

Inpainting / Completion / Outpainting

  • Contexual-Attention: "Generative Image Inpainting with Contextual Attention", CVPR, 2018 (UIUC). [Paper][Tensorflow]
  • PEN-Net: "Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting", CVPR, 2019 (Microsoft). [Paper][PyTorch]
  • Copy-Paste: "Copy-and-Paste Networks for Deep Video Inpainting", ICCV, 2019 (Yonsei University). [Paper][PyTorch]
  • Onion-Peel: "Onion-Peel Networks for Deep Video Completion", ICCV, 2019 (Yonsei University). [Paper][PyTorch]
  • STTN: "Learning Joint Spatial-Temporal Transformations for Video Inpainting", ECCV, 2020 (Microsoft). [Paper][PyTorch]
  • FuseFormer: "FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting", ICCV, 2021 (CUHK + SenseTime). [Paper][PyTorch]
  • ICT: "High-Fidelity Pluralistic Image Completion with Transformers", ICCV, 2021 (CUHK). [Paper][PyTorch][Website]
  • DSTT: "Decoupled Spatial-Temporal Transformer for Video Inpainting", arXiv, 2021 (CUHK + SenseTime). [Paper][Code (in construction)]
  • TFill: "TFill: Image Completion via a Transformer-Based Architecture", arXiv, 2021 (NTU Singapore). [Paper][Code (in construction)]
  • BAT-Fill: "Diverse Image Inpainting with Bidirectional and Autoregressive Transformers", arXiv, 2021 (NTU Singapore). [Paper]
  • ?: "Image-Adaptive Hint Generation via Vision Transformer for Outpainting", WACV, 2022 (Sogang University, Korea). [Paper]
  • ZITS: "Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding", CVPR, 2022 (Fudan). [Paper][PyTorch][Website]
  • MAT: "MAT: Mask-Aware Transformer for Large Hole Image Inpainting", CVPR, 2022 (CUHK). [Paper][PyTorch]
  • PUT: "Reduce Information Loss in Transformers for Pluralistic Image Inpainting", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • DLFormer: "DLFormer: Discrete Latent Transformer for Video Inpainting", CVPR, 2022 (Tencent). [Paper][Code (in construction)]
  • T-former: "T-former: An Efficient Transformer for Image Inpainting", ACMMM, 2022 (Xi'an Jiaotong). [Paper][PyTorch]
  • QueryOTR: "Outpainting by Queries", ECCV, 2022 (University of Liverpool, UK). [Paper][PyTorch (in construction)]
  • FGT: "Flow-Guided Transformer for Video Inpainting", ECCV, 2022 (USTC). [Paper][PyTorch]
  • MAE-FAR: "Learning Prior Feature and Attention Enhanced Image Inpainting", ECCV, 2022 (Fudan University). [Paper][PyTorch (in construction)][Website]
  • ?: "Visual Prompting via Image Inpainting", NeurIPS, 2022 (Berkeley). [Paper][PyTorch][Website]
  • U-Transformer: "Generalised Image Outpainting with U-Transformer", arXiv, 2022 (Xi'an Jiaotong-Liverpool University). [Paper]
  • SpA-Former: "SpA-Former: Transformer image shadow detection and removal via spatial attention", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
  • CRFormer: "CRFormer: A Cross-Region Transformer for Shadow Removal", arXiv, 2022 (Beijing Jiaotong University). [Paper]
  • DeViT: "DeViT: Deformed Vision Transformers in Video Inpainting", arXiv, 2022 (Kuaishou). [Paper]
  • ZITS++: "ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors", arXiv, 2022 (Fudan). [Paper]
  • TPFNet: "TPFNet: A Novel Text In-painting Transformer for Text Removal", arXiv, 2022 (?). [Paper][Code (in construction)]
  • FlowLens: "FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
  • ?: "Putting People in Their Place: Affordance-Aware Human Insertion into Scenes", CVPR, 2023 (Stanford). [Paper][PyTorch (in construction)][Website]
  • Imagen-Editor: "Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting", CVPR, 2023 (Google). [Paper][Website]
  • SmartBrush: "SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model", CVPR, 2023 (Adobe). [Paper]
  • NÜWA-LIP: "NÜWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN", CVPR, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
  • ProPainter: "ProPainter: Improving Propagation and Transformer for Video Inpainting", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
  • Inst-Inpaint: "Inst-Inpaint: Instructing to Remove Objects with Diffusion Models", arXiv, 2023 (Bilkent University, Turkey). [Paper]
  • Inpaint-Anything: "Inpaint Anything: Segment Anything Meets Image Inpainting", arXiv, 2023 (USTC). [Paper][PyTorch]
  • TransRef: "TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
  • DMT: "Deficiency-Aware Masked Transformer for Video Inpainting", arXiv, 2023 (CAS). [Paper][Code (in construction)]
  • Magicremover: "Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Models", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
  • LGVI: "Towards Language-Driven Video Inpainting via Multimodal Large Language Models", arXiv, 2024 (Shanghai AI Lab). [Paper][Code (in construction)][Website]

[Back to Overview]

Image Generation

  • IT: "Image Transformer", ICML, 2018 (Google). [Paper][Tensorflow]
  • PixelSNAIL: "PixelSNAIL: An Improved Autoregressive Generative Model", ICML, 2018 (Berkeley). [Paper][Tensorflow]
  • BigGAN: "Large Scale GAN Training for High Fidelity Natural Image Synthesis", ICLR, 2019 (DeepMind). [Paper][PyTorch]
  • SAGAN: "Self-Attention Generative Adversarial Networks", ICML, 2019 (Google). [Paper][Tensorflow]
  • VQGAN: "Taming Transformers for High-Resolution Image Synthesis", CVPR, 2021 (Heidelberg University). [Paper][PyTorch][Website]
  • ?: "High-Resolution Complex Scene Synthesis with Transformers", CVPRW, 2021 (Heidelberg University). [Paper]
  • GANsformer: "Generative Adversarial Transformers", ICML, 2021 (Stanford + Facebook). [Paper][Tensorflow]
  • PixelTransformer: "PixelTransformer: Sample Conditioned Signal Generation", ICML, 2021 (Facebook). [Paper][Website]
  • HWT: "Handwriting Transformers", ICCV, 2021 (MBZUAI). [Paper][Code (in construction)]
  • Paint-Transformer: "Paint Transformer: Feed Forward Neural Painting with Stroke Prediction", ICCV, 2021 (Baidu). [Paper][Paddle][PyTorch]
  • Geometry-Free: "Geometry-Free View Synthesis: Transformers and no 3D Priors", ICCV, 2021 (Heidelberg University). [Paper][PyTorch]
  • VTGAN: "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers", ICCVW, 2021 (University of Nevada, Reno). [Paper]
  • ATISS: "ATISS: Autoregressive Transformers for Indoor Scene Synthesis", NeurIPS, 2021 (NVIDIA). [Paper][Website]
  • GANsformer2: "Compositional Transformers for Scene Generation", NeurIPS, 2021 (Stanford + Facebook). [Paper][Tensorflow]
  • TransGAN: "TransGAN: Two Transformers Can Make One Strong GAN", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
  • HiT: "Improved Transformer for High-Resolution GANs", NeurIPS, 2021 (Google). [Paper][Tensorflow]
  • iLAT: "The Image Local Autoregressive Transformer", NeurIPS, 2021 (Fudan). [Paper]
  • TokenGAN: "Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers", NeurIPS, 2021 (Microsoft). [Paper]
  • SceneFormer: "SceneFormer: Indoor Scene Generation with Transformers", arXiv, 2021 (TUM). [Paper]
  • SNGAN: "Combining Transformer Generators with Convolutional Discriminators", arXiv, 2021 (Fraunhofer ITWM). [Paper]
  • Invertible-Attention: "Invertible Attention", arXiv, 2021 (ANU). [Paper]
  • GPA: "Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation", arXiv, 2021 (Zalando Research, Germany). [Paper][PyTorch (in construction)]
  • ViTGAN: "ViTGAN: Training GANs with Vision Transformers", ICLR, 2022 (Google). [Paper][PyTorch][PyTorch (wilile26811249)]
  • ViT-VQGAN: "Vector-quantized Image Modeling with Improved VQGAN", ICLR, 2022 (Google). [Paper]
  • Style-Transformer: "Style Transformer for Image Inversion and Editing", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
  • StyleSwin: "StyleSwin: Transformer-based GAN for High-resolution Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Styleformer: "Styleformer: Transformer based Generative Adversarial Networks with Style Vector", CVPR, 2022 (Seoul National University). [Paper][PyTorch]
  • ?: "User-Controllable Latent Transformer for StyleGAN Image Layout Editing", Pacific Graphics, 2022 (University of Tsukuba). [Paper][Website]
  • DynaST: "DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation", ECCV, 2022 (NUS). [Paper][PyTorch]
  • DoodleFormer: "DoodleFormer: Creative Sketch Drawing with Transformers", ECCV, 2022 (MBZUAI). [Paper][PyTorch][Website]
  • U-Attention: "Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis", arXiv, 2022 (Adobe). [Paper]
  • MaskGIT: "MaskGIT: Masked Generative Image Transformer", CVPR, 2022 (Google). [Paper][PyTorch (dome272)]
  • AttnFlow: "Generative Flows with Invertible Attentions", CVPR, 2022 (ETHZ). [Paper]
  • NÜWA: "NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion", ECCV, 2022 (Microsoft). [Paper][GitHub]
  • Trans-INR: "Transformers as Meta-Learners for Implicit Neural Representations", ECCV, 2022 (UCSD). [Paper][PyTorch][Websiste]
  • ViewFormer: "ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers", ECCV, 2022 (Czech Technical University in Prague). [Paper][Tensorflow]
  • Unleashing-Transformer: "Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes", ECCV, 2022 (Durham University, UK). [Paper][PyTorch]
  • CASD: "Cross Attention Based Style Distribution for Controllable Person Image Synthesis", ECCV, 2022 (East China Norma lUniversity). [Paper]
  • VQGAN-CLIP: "VQGAN-CLIP: Open Domain Image Generation and Manipulation Using Natural Language ", ECCV, 2022 (EleutherAI). [Paper][PyTorch]
  • Token-Critic: "Improved Masked Image Generation with Token-Critic", ECCV, 2022 (Google). [Paper]
  • PromptGen: "Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models", NeurIPS, 2022 (CMU). [Paper][PyTorch]
  • Contextual-RQ-Transformer: "Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer", NeurIPS, 2022 (POSTECH + Kakao). [Paper]
  • ViT-Patch: "A Robust Framework of Chromosome Straightening with ViT-Patch GAN", arXiv, 2022 (Xi'an Jiaotong-Liverpool University). [Paper]
  • ?: "Transforming Image Generation from Scene Graphs", arXiv, 2022 (University of Catania, Italy). [Paper]
  • VisionNeRF: "Vision Transformer for NeRF-Based View Synthesis from a Single Input Image", arXiv, 2022 (Google). [Paper][Website]
  • NUWA-Infinity: "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis", arXiv, 2022 (Microsoft). [Paper][GitHub][Website]
  • Diffusion-ViT: "Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model", arXiv, 2022 (Etsy, NY). [Paper]
  • ?: "Visual Prompt Tuning for Generative Transfer Learning", CVPR, 2023 (Google). [Paper][JAX]
  • SeQ-GAN: "Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis", arXiv, 2022 (Tencent). [Paper][Code (in construction)]
  • ?: "Style-Guided Inference of Transformer for High-resolution Image Synthesis", WACV, 2023 (NCSOFT, Korea). [Paper]
  • Frido: "Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis", AAAI, 2023 (Microsoft). [Paper][PyTorch]
  • GNT: "Is Attention All That NeRF Needs?", ICLR, 2023 (UT Austin). [Paper][PyTorch][Website]
  • DPC: "Discrete Predictor-Corrector Diffusion Models for Image Synthesis", ICLR, 2023 (Google). [Paper]
  • LayoutDM: "LayoutDM: Discrete Diffusion Model for Controllable Layout Generation", CVPR, 2023 (CyberAgent, Japan). [Paper][PyTorch][Website]
  • GTGAN: "Graph Transformer GANs for Graph-Constrained House Generation", CVPR, 2023 (ETHZ). [Paper]
  • U-ViT: "All are Worth Words: A ViT Backbone for Diffusion Models", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
  • MQ-VAE: "Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation", CVPR, 2023 (USTC). [Paper][PyTorch]
  • MaskSketch: "MaskSketch: Unpaired Structure-guided Masked Image Generation", CVPR, 2023 (Google). [Paper][JAX][Website]
  • GAN-MAE: "Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond", CVPR, 2023 (Meituan). [Paper]
  • Reg-VQ: "Regularized Vector Quantization for Tokenized Image Synthesis", CVPR, 2023 (NTU, Singapore). [Paper]
  • LCP-GAN: "Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis", CVPR, 2023 (South China University of Technology). [Paper]
  • Slot-VAE: "Slot-VAE: Object-Centric Scene Generation with Slot Attention", ICML, 2023 (Delft University of Technology, Netherland). [Paper]
  • Efficient-VQGAN: "Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers", ICCV, 2023 (Alibaba). [Paper]
  • MDT: "Masked Diffusion Transformer is a Strong Image Synthesizer", ICCV, 2023 (Sea AI Lab). [Paper][PyTorch]
  • LayoutPrompter: "LayoutPrompter: Awaken the Design Ability of Large Language Models", NeurIPS, 2023 (Microsoft). [Paper]
  • LayoutGPT: "LayoutGPT: Compositional Visual Planning and Generation with Large Language Models", NeurIPS, 2023 (UCSB). [Paper][PyTorch][Website]
  • Diff-Instruct: "Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models", NeurIPS, 2023 (Huawei). [Paper]
  • VQ3D: "VQ3D: Learning a 3D-Aware Generative Model on ImageNet", arXiv, 2023 (Stanford). [Paper][Website]
  • LayoutDiffuse: "LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation", arXiv, 2023 (Amazon). [Paper]
  • StraIT: "StraIT: Non-autoregressive Generation with Stratified Image Transformer", arXiv, 2023 (Google). [Paper]
  • MMoT: "MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis", arXiv, 2023 (South China University of Technology). [Paper][PyTorch (in construction)][Website]
  • MAskDiT: "Fast Training of Diffusion Models with Masked Transformers", arXiv, 2023 (NVIDIA). [Paper]
  • Dolfin: "Dolfin: Diffusion Layout Transformers without Autoencoder", arXiv, 2023 (UCSD). [Paper]
  • RALF: "Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation", arXiv, 2023 (The University of Tokyo). [Paper][Website]
  • GIVT: "GIVT: Generative Infinite-Vocabulary Transformers", arXiv, 2023 (Google). [Paper]
  • DiffiT: "DiffiT: Diffusion Vision Transformers for Image Generation", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)]
  • RCG: "Self-conditioned Image Generation via Generating Representations", arXiv, 2023 (MIT). [Paper][PyTorch]
  • GSN: "GSN: Generalisable Segmentation in Neural Radiance Field", AAAI, 2024 (IIIT Hyderabad). [Paper][PyTorch][Website]
  • HDiT: "Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers", arXiv, 2024 (Stability AI). [Paper][Website]
  • ZigMa: "ZigMa: Zigzag Mamba Diffusion Model", arXiv, 2024 (LMU Munich). [Paper][Code (in construction)][Website]
  • VAR: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction", arXiv, 2024 (Bytedance). [Paper][PyTorch (in construction)][Website]

[Back to Overview]

Video Generation

  • Subscale: "Scaling Autoregressive Video Models", ICLR, 2020 (Google). [Paper][Website]
  • ConvTransformer: "ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis", arXiv, 2020 (Southeast University). [Paper]
  • OCVT: "Generative Video Transformer: Can Objects be the Words?", ICML, 2021 (Rutgers University). [Paper]
  • AIST++: "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation", arXiv, 2021 (Google). [Paper][Code][Website]
  • VideoGPT: "VideoGPT: Video Generation using VQ-VAE and Transformers", arXiv, 2021 (Berkeley). [Paper][PyTorch][Website]
  • DanceFormer: "DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer", AAAI, 2022 (Huiye Technology, China). [Paper]
  • VFIformer: "Video Frame Interpolation with Transformer", CVPR, 2022 (CUHK). [Paper][PyTorch]
  • VFIT: "Video Frame Interpolation Transformer", CVPR, 2022 (McMaster Univeristy, Canada). [Paper][PyTorch]
  • MoTrans: "Motion Transformer for Unsupervised Image Animation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
  • Transframer: "Transframer: Arbitrary Frame Prediction with Generative Models", arXiv, 2022 (DeepMind). [Paper]
  • TATS: "Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer", ECCV, 2022 (Maryland). [Paper][Website]
  • POVT: "Patch-based Object-centric Transformers for Efficient Video Generation", arXiv, 2022 (Berkeley). [Paper][PyTorch][Website]
  • TAIN: "Cross-Attention Transformer for Video Interpolation", arXiv, 2022 (Duke). [Paper]
  • TTVFI: "TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation", arXiv, 2022 (Microsoft). [Paper]
  • SlotFormer: "SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models", arXiv, 2022 (University of Toronto). [Paper][Website]
  • Human-MotionFormer: "Human MotionFormer: Transferring Human Motions with Vision Transformers", ICLR, 2023 (HKUST + Huya). [Paper][Code (in construction)]
  • MAGVIT: "MAGVIT: Masked Generative Video Transformer", CVPR, 2023 (Google). [Paper][Code (in construction)][Website]
  • MeBT: "Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers", CVPR, 2023 (Kakao). [Paper][PyTorch][Website]
  • BiFormer: "BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation", CVPR, 2023 (Korea University). [Paper][PyTorch (in construction)]
  • AMT: "AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation", CVPR, 2023 (Nankai University). [Paper][PyTorch][Website]
  • ?: "Frame Interpolation Transformer and Uncertainty Guidance", CVPR, 2023 (Disney). [Paper]
  • EMA-VFI: "Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation", CVPR, 2023 (Nanjing University). [Paper][PyTorch]
  • EIF-BiOFNet: "Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields", CVPR, 2023 (KAIST). [Paper]
  • TECO: "Temporally Consistent Video Transformer for Long-Term Video Prediction", ICML, 2023 (Berkeley). [Paper][JAX][Website]
  • VFIFT: "Video Frame Interpolation with Flow Transformer", ACMMM, 2023 (Nanjing University of Aeronautics and Astronautics). [Paper]
  • ConvSSM: "Convolutional State Space Models for Long-Range Spatiotemporal Modeling", NeurIPS, 2023 (NVIDIA). [Paper][JAX]
  • NUWA-XL: "NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation", arXiv, 2023 (Microsoft). [Paper][Website (in construction)]
  • CAT-NeRF: "CAT-NeRF: Constancy-Aware Tx2Former for Dynamic Body Modeling", arXiv, 2023 (USC). [Paper]
  • IconShop: "IconShop: Text-Based Vector Icon Synthesis with Autoregressive Transformers", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
  • VDT: "VDT: An Empirical Study on Video Diffusion with Transformers", arXiv, 2023 (Renmin University of China). [Paper][PyTorch]
  • MAGVIT-v2: "Language Model Beats Diffusion - Tokenizer is Key to Visual Generation", arXiv, 2023 (Google). [Paper][Website]
  • UVDv1: "Sequential Modeling Enables Scalable Learning for Large Vision Models", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
  • W.A.L.T: "Photorealistic Video Generation with Diffusion Models", arXiv, 2023 (Google). [Paper][Website]
  • ?: "SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces", ICLRW, 2024 (University of Tokyo). [Paper][PyTorch]
  • ?: "Video as the New Language for Real-World Decision Making", arXiv, 2024 (DeepMind). [Paper]
  • Exo2Ego: "Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos", arXiv, 2024 (Meta). [Paper]

[Back to Overview]

Transfer / Translation / Manipulation

  • AdaAttN: "AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer", ICCV, 2021 (Baidu). [Paper][Paddle][PyTorch]
  • StyleCLIP: "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery", ICCV, 2021 (Hebrew University of Jerusalem). [Paper][PyTorch]
  • StyTr2: "StyTr^2: Unbiased Image Style Transfer with Transformers", CVPR, 2022 (CAS). [Paper][PyTorch]
  • InstaFormer: "InstaFormer: Instance-Aware Image-to-Image Translation with Transformer", CVPR, 2022 (Korea University). [Paper]
  • ManiTrans: "ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation", CVPR, 2022 (Huawei). [Paper][Website]
  • QS-Attn: "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation", CVPR, 2022 (Shanghai Key Laboratory). [Paper][PyTorch]
  • Splice: "Splicing ViT Features for Semantic Appearance Transfer", CVPR, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
  • ASSET: "ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions", SIGGRAPH, 2022 (Adobe). [Paper][PyTorch][Website]
  • SCAM: "SCAM! Transferring humans between images with Semantic Cross Attention Modulation", ECCV, 2022 (Univ Gustave Eiffel, France). [Paper][PyTorch][Website]
  • TargetCLIP: "Image-Based CLIP-Guided Essence Transfer", ECCV, 2022 (Tel Aviv). [Paper][PyTorch]
  • FFCLIP: "One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
  • STTR: "Fine-Grained Image Style Transfer with Visual Transformers", ACCV, 2022 (The Univerisity of Tokyo). [Paper][PyTorch (in construction)]
  • UVCGAN: "UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation", arXiv, 2022 (Brookhaven National Laboratory, NY). [Paper]
  • ITTR: "ITTR: Unpaired Image-to-Image Translation with Transformers", arXiv, 2022 (Kuaishou). [Paper]
  • CLIPasso: "CLIPasso: Semantically-Aware Object Sketching", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
  • CTrGAN: "CTrGAN: Cycle Transformers GAN for Gait Transfer", arXiv, 2022 (Ariel University, Israel). [Paper]
  • PI-Trans: "PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation", arXiv, 2022 (University of Trento, Italy). [Paper][PyTorch (in construction)]
  • CSLA: "Bridging CLIP and StyleGAN through Latent Alignment for Image Editing", arXiv, 2022 (Kuaishou). [Paper]
  • CLIP-PAE: "CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Image Manipulation", arXiv, 2022 (University of Cambridge). [Paper]
  • S2WAT: "S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention", arXiv, 2022 (Sichuan Normal University). [Paper]
  • DiffuseIT: "Diffusion-based Image Translation using Disentangled Style and Content Representation", ICLR, 2023 (KAIST). [Paper]
  • MATEBIT: "Masked and Adaptive Transformer for Exemplar Based Image Translation", CVPR, 2023 (Hangzhou Dianzi University). [Paper][Pytorch]
  • IPL: "Zero-shot Generative Model Adaptation via Image-specific Prompt Learning", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
  • Master: "Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer", CVPR, 2023 (NUS). [Paper]
  • LENeRF: "Local 3D Editing via 3D Distillation of CLIP Knowledge", CVPR, 2023 (Kakao). [Paper]
  • SINE: "SINE: SINgle Image Editing with Text-to-Image Diffusion Models", CVPR, 2023 (Rutgers). [Paper][PyTorch]
  • Imagic: "Imagic: Text-Based Real Image Editing with Diffusion Models", CVPR, 2023 (Google). [Paper][Website]
  • DATID-3D: "DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model", CVPR, 2023 (SNU). [Paper][PyTorch][Website]
  • Null-text-Inversion: "Null-text Inversion for Editing Real Images using Guided Diffusion Models", CVPR, 2023 (Google). [Paper]
  • LANIT: "LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data", CVPR, 2023 (Korea University). [Paper][PyTorch]
  • StylerDALLE: "StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model", ICCV, 2023 (University of Trento, Italy). [Paper][PyTorch]
  • ****: "Disentangling Structure and Appearance in ViT Feature Space", ACM ToG, 2023 (Weizmann Institute of Science (WIS), Israel). [Paper][PyTorch][Website]
  • pix2pix-zero: "Zero-shot Image-to-Image Translation", arXiv, 2023 (Adobe). [Paper][Code (in construction)][Website]
  • SpectralCLIP: "SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective", arXiv, 2023 (University of Trento, Italy). [Paper][Code (in construction)]
  • PGIC: "A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations", arXiv, 2023 (Fudan). [Paper][Code (in construction)]

[Back to Overview]

Other Low-Level Tasks

  • Colorization:
    • ColTran: "Colorization Transformer", ICLR, 2021 (Google). [Paper][Tensorflow]
    • ViT-I-GAN: "ViT-Inception-GAN for Image Colourising", arXiv, 2021 (D.Y Patil College of Engineering, India). [Paper]
    • CT2: "CT2: Colorization Transformer via Color Tokens", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • L-CoDer: "L-CoDer: Language-based Colorization with Color-object Decoupling Transformer", ECCV, 2022 (Beijing University of Posts and Telecommunications). [Paper]
    • ColorFormer: "ColorFormer: Image Colorization via Color Memory assisted Hybrid-attention Transformer", ECCV, 2022 (Tencent). [Paper]
    • UniColor: "UniColor: A Unified Framework for Multi-Modal Colorization with Transformer", SIGGRAPH Asia, 2022 (CUHK). [Paper][Website]
    • iColoriT: "iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer", arXiv, 2022 (KAIST). [Paper]
    • L-CoIns: "L-CoIns: Language-based Colorization with Instance Awareness", CVPR, 2023 (Beijing University of Posts and Telecommunications). [Paper]
    • L-CAD: "L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors", NeurIPS, 2023 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • Enhancement:
    • PanFormer: "PanFormer: a Transformer Based Model for Pan-sharpening", ICME, 2022 (Beihang University). [Paper][PyTorch]
    • URSCT-UIE: "Reinforced Swin-Convs Transformer for Underwater Image Enhancement", arXiv, 2022 (Ningbo University). [Paper]
    • IAT: "Illumination Adaptive Transformer", arXiv, 2022 (The University of Tokyo). [Paper][PyTorch]
    • SPGAT: "Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper]
    • SSTF: "End-to-end Transformer for Compressed Video Quality Enhancement", arXiv, 2022 (Nanjing University of Information Science and Technology). [Paper]
    • CLIP-LiT: "Iterative Prompt Learning for Unsupervised Backlit Image Enhancement", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • Retinexformer: "Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
  • High Dynamic Range (HDR):
    • CA-ViT: "Ghost-free High Dynamic Range Imaging with Context-aware Transformer", ECCV, 2022 (Megvii). [Paper][PyTorch]
    • Selective-TransHDR: "Selective TransHDR: Transformer-Based Selective HDR Imaging Using Ghost Region Mask", ECCV, 2022 (Sogang University, Korea). [Paper]
    • Text2Light: "Text2Light: Zero-Shot Text-Driven HDR Panorama Generation", SIGGRAPH Asia, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • SMAE: "SMAE: Few-shot Learning for HDR Deghosting with Saturation-Aware Masked Autoencoders", CVPR, 2023 (Northwestern Polytechnical University). [Paper]
    • SCTNet: "Alignment-free HDR Deghosting with Semantics Consistent Transformer", ICCV, 2023 (University of Bourgogne, France). [Paper][Website]
    • ?: "Online Overexposed Pixels Hallucination in Videos with Adaptive Reference Frame Selection", arXiv, 2023 (NVIDIA). [Paper]
    • IFT: "IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging", arXiv, 2023 (Huawei). [Paper]
  • Harmonization:
    • HT: "Image Harmonization With Transformer", ICCV, 2021 (Ocean University of China). [Paper]
    • LEMaRT: "LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization", CVPR, 2023 (Amazon). [Paper]
  • Compression:
    • ?: "Towards End-to-End Image Compression and Analysis with Transformers", AAAI, 2022 (1Harbin Institute of Technology). [Paper][PyTorch]
    • Entroformer: "Entroformer: A Transformer-based Entropy Model for Learned Image Compression", ICLR, 2022 (Alibaba). [Paper]
    • STF: "The Devil Is in the Details: Window-based Attention for Image Compression", CVPR, 2022 (CAS). [Paper][PyTorch]
    • Contextformer: "Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression", ECCV, 2022 (TUM). [Paper]
    • VCT: "VCT: A Video Compression Transformer", NeurIPS, 2022 (Google). [Paper]
    • MIMT: "MIMT: Masked Image Modeling Transformer for Video Compression", ICLR, 2023 (Tencent). [Paper]
    • TCM: "Learned Image Compression with Mixed Transformer-CNN Architectures", CVPR, 2023 (Waseda University). [Paper][PyTorch]
    • TransTIC: "TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception", ICCV, 2023 (NYCU). [Paper]
    • Prompt-ICM: "Prompt-ICM: A Unified Framework towards Image Coding for Machines with Task-driven Prompts", arXiv, 2023 (USTC). [Paper]
    • FAT-LIC: "Frequency-Aware Transformer for Learned Image Compression", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
  • Matting:
    • MatteFormer: "MatteFormer: Transformer-Based Image Matting via Prior-Tokens", CVPR, 2022 (SNU + NAVER). [Paper][PyTorch]
    • TransMatting: "TransMatting: Enhancing Transparent Objects Matting with Transformers", ECCV, 2022 (CAS). [Paper][Code (in construction)]
    • VMFormer: "VMFormer: End-to-End Video Matting with Transformer", arXiv, 2022 (PicsArt). [Paper][PyTorch][Website]
    • CLIPMat: "Referring Image Matting", CVPR, 2023 (The University of Sydney). [Paper][Code (in construction)]
    • ViTMatte: "ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers", arXiv, 2023 (Xiaobing.AI). [Paper]
    • MAM: "Matting Anything", arXiv, 2023 (UIUC). [Paper][PyTorch][Website]
    • MaGGIe: "MaGGIe: Masked Guided Gradual Human Instance Matting", CVPR, 2024 (Adobe). [Paper][Website]
  • Reconstruction:
    • ET-Net: "Event-Based Video Reconstruction Using Transformer", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
    • GradViT: "GradViT: Gradient Inversion of Vision Transformers", CVPR, 2022 (NVIDIA). [Paper][Website]
    • MST: "Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
    • MST++: "MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction", CVPRW, 2022 (Tsinghua). [Paper][PyTorch]
    • CST: "Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
    • DAUHST: "Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
    • S2-Transformer: "S2-Transformer for Mask-Aware Hyperspectral Image Reconstruction", arXiv, 2022 (Rochester Institute of Technology). [Paper]
    • NLOST: "NLOST: Non-Line-of-Sight Imaging with Transformer", CVPR, 2023 (USTC). [Paper][Code (in construction)]
    • MinD-Vis: "Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding", CVPR, 2023 (NUS). [Paper][PyTorch][Website]
    • PADUT: "Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction", ICCV, 2023 (Beijing Institute of Technology). [Paper][PyTorch]
    • GTA: "Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction", NeurIPS, 2023 (Zhejiang). [Paper]
  • Radiance Fields:
    • NeXT: "NeXT: Towards High Quality Neural Radiance Fields via Multi-Skip Transformer", ECCV, 2022 (Tsinghua University). [Paper][JAX]
    • TransNeRF: "Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer", arXiv, 2022 (UBC). [Paper]
    • ABLE-NeRF: "ABLE-NeRF: Attention-Based Rendering with Learnable Embeddings for Neural Radiance Field", CVPR, 2023 (NTU, Singapore). [Paper]
    • TransHuman: "TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering", ICCV, 2023 (Alibaba). [Paper][PyTorch][Website]
    • GNT-MOVE: "Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts", ICCV, 2023 (UT Austin). [Paper][PyTorch]
    • ReTR: "ReTR: Modeling Rendering Via Transformer for Generalizable Neural Surface Reconstruction", NeurIPS, 2023 (HKUST). [Paper][PyTorch]
  • 3D:
    • MNSRNet: "MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution", CVPR, 2022 (Shenzhen University). [Paper]
  • Others:
    • TransMEF: "TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning", AAAI, 2022 (Fudan). [Paper]
    • MS-Unet: "Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer", CVPR, 2022 (Megvii). [Paper][Code (in construction)]
    • TransCL: "TransCL: Transformer Makes Strong and Flexible Compressive Learning", TPAMI, 2022 (Peking University). [Paper][Code (in construction)]
    • GAP-CSCoT: "Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer", arXiv, 2022 (CAS). [Paper]
    • MatFormer: "MatFormer: A Generative Model for Procedural Materials", arXiv, 2022 (Adobe). [Paper]
    • FishFormer: "FishFormer: Annulus Slicing-based Transformer for Fisheye Rectification with Efficacy Domain Exploration", arXiv, 2022 (Beijing Jiaotong University). [Paper]
    • STFormer: "Spatial-Temporal Transformer for Video Snapshot Compressive Imaging", arXiv, 2022 (CAS). [Paper][PyTorch]
    • OCTUF: "Optimization-Inspired Cross-Attention Transformer for Compressive Sensing", CVPR, 2023 (Peking University). [Paper][PyTorch]
    • TopNet: "TopNet: Transformer-based Object Placement Network for Image Compositing", CVPR, 2023 (Adobe). [Paper]
    • RHWF: "Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer", CVPR, 2023 (Zhejiang University). [Paper][Code (in construction)]
    • M2T: "M2T: Masking Transformers Twice for Faster Decoding", ICCV, 2023 (Google). [Paper]
    • CTM: "Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging", ICCV, 2023 (CAS). [Paper]
    • PromptGIP: "Unifying Image Processing as Visual Prompting Question Answering", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • FILM: "Image Fusion via Vision-Language Model", arXiv, 2024 (Xi'an Jiaotong University). [Paper]

[Back to Overview]

Reinforcement Learning

Navigation

  • VTNet: "VTNet: Visual Transformer Network for Object Goal Navigation", ICLR, 2021 (ANU). [Paper]
  • MaAST: "MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation", ICRA, 2021 (SRI). [Paper]
  • TransFuser: "Multi-Modal Fusion Transformer for End-to-End Autonomous Driving", CVPR, 2021 (MPI). [Paper][PyTorch]
  • CMTP: "Topological Planning With Transformers for Vision-and-Language Navigation", CVPR, 2021 (Stanford). [Paper]
  • VLN-BERT: "VLN-BERT: A Recurrent Vision-and-Language BERT for Navigation", CVPR, 2021 (ANU). [Paper][PyTorch]
  • E.T.: "Episodic Transformer for Vision-and-Language Navigation", ICCV, 2021 (Google). [Paper][PyTorch]
  • HAMT: "History Aware Multimodal Transformer for Vision-and-Language Navigation", NeurIPS, 2021 (INRIA). [Paper][PyTorch][Website]
  • SOAT: "SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation", NeurIPS, 2021 (Georgia Tech). [Paper]
  • OMT: "Object Memory Transformer for Object Goal Navigation", ICRA, 2022 (AIST, Japan). [Paper]
  • ADAPT: "ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts", CVPR, 2022 (Huawei). [Paper]
  • DUET: "Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation", CVPR, 2022 (INRIA). [Paper][Website]
  • LSA: "Local Slot Attention for Vision-and-Language Navigation", ICMR, 2022 (Fudan). [Paper]
  • ?: "Learning from Unlabeled 3D Environments for Vision-and-Language Navigation", ECCV, 2022 (INRIA). [Paper][Website]
  • MTVM: "Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
  • DDL: "Learning Disentanglement with Decoupled Labels for Vision-Language Navigation", ECCV, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
  • Sim2Sim: "Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments", ECCV, 2022 (Oregon State University). [Paper][PyTorch][Website]
  • AVLEN: "AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments", NeurIPS, 2022 (UC Riverside). [Paper]
  • ZSON: "ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings", NeurIPS, 2022 (Georgia Tech). [Paper]
  • WS-MGMap: "Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation", NeurIPS, 2022 (South China University of Technology). [Paper][PyTorch (in construction)]
  • CLIP-Nav: "CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation", CoRLW, 2022 (Amazon). [Paper]
  • TransFuser: "TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving", arXiv, 2022 (MPI). [Paper]
  • TD-STP: "Target-Driven Structured Transformer Planner for Vision-Language Navigation", arXiv, 2022 (Beihang University). [Paper][Code (in construction)]
  • DAVIS: "Anticipating the Unseen Discrepancy for Vision and Language Navigation", arXiv, 2022 (UCSB). [Paper]
  • LOViS: "LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation", arXiv, 2022 (Michigan State). [Paper]
  • BEVBert: "BEVBert: Topo-Metric Map Pre-training for Language-guided Navigation", arXiv, 2022 (CAS). [Paper]
  • Meta-Explore: "Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding", CVPR, 2023 (Seoul National University). [Paper][Website]
  • LANA: "Lana: A Language-Capable Navigator for Instruction Following and Generation", CVPR, 2023 (Zhejiang University). [Paper][PyTorch (in construction)]
  • KERM: "KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation", CVPR, 2023 (CAS). [Paper][PyTorch]
  • VLN-SIG: "Improving Vision-and-Language Navigation by Generating Future-View Image Semantics", CVPR, 2023 (UNC). [Paper][PyTorch][Website]
  • GeoVLN: "GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation", CVPR, 2023 (Fudan). [Paper]
  • IVLN: "Iterative Vision-and-Language Navigation", CVPR, 2023 (Oregon State University). [Paper]
  • AZHP: "Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation", CVPR, 2023 (Beihang University). [Paper][Code (in construction)]
  • MARVAL: "A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning", CVPR, 2023 (Google). [Paper]
  • VO-Transformer: "Modality-invariant Visual Odometry for Embodied Vision", CVPR, 2023 (EPFL). [Paper][Website]
  • VLN-Behave: "Behavioral Analysis of Vision-and-Language Navigation Agents", CVPR, 2023 (Oregon State). [Paper][Code]
  • Lily: "Learning Vision-and-Language Navigation from YouTube Videos", ICCV, 2023 (South China University of Technology). [Paper][PyTorch]
  • ScaleVLN: "Scaling Data Generation in Vision-and-Language Navigation", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • BSG: "Bird's-Eye-View Scene Graph for Vision-Language Navigation", ICCV, 2023 (Zhejiang University). [Paper][Code (in construction)]
  • AerialVLN: "AerialVLN: Vision-and-Language Navigation for UAVs", ICCV, 2023 (Northwestern Polytechnical University). [Paper][PyTorch]
  • DREAMWALKER: "DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation", ICCV, 2023 (Beijing Institute of Technology). [Paper][Code (in construction)]
  • VLN-PETL: "VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation", ICCV, 2023 (The University of Adelaide, Australia). [Paper][Code (in construction)]
  • MiC: "March in Chat: Interactive Prompting for Remote Embodied Referring Expression", ICCV, 2023 (The University of Adelaide, Australia). [Paper][Code (in construction)]
  • GELA: "Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation", ICCV, 2023 (Chinese Academy of Military Science). [Paper][PyTorch]
  • GridMM: "GridMM: Grid Memory Map for Vision-and-Language Navigation", ICCV, 2023 (CAS). [Paper][PyTorch]
  • LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 (OSU). [Paper][Code (in construction)][Website]
  • Le-RNR-Map: "Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language", ICCVW, 2023 (University of Verona, Italy). [Paper][Code (in construction)][Website]
  • LACMA: "LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following", EMNLP, 2023 (Microsoft). [Paper][PyTorch]
  • FGPrompt: "FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation", NeurIPS, 2023 (South China University of Technology). [Paper][PyTorch][Website]
  • PanoGen: "PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation", NeurIPS, 2023 (UNC). [Paper][PyTorch][Website]
  • MLANet: "MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation", arXiv, 2023 (Tongji University). [Paper][PyTorch]
  • ENTL: "ENTL: Embodied Navigation Trajectory Learner", arXiv, 2023 (AI2). [Paper]
  • MPM: "Masked Path Modeling for Vision-and-Language Navigation", arXiv, 2023 (UCLA). [Paper]
  • NavGPT: "NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models", arXiv, 2023 (The University of Adelaide, Australia). [Paper]
  • MO-VLN: "MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation", arXiv, 2023 (Sun Yat-Sen University). [Paper][Code (in construction)][Website]
  • ViNT: "ViNT: A Foundation Model for Visual Navigation", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
  • A2Nav: "A2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models", arXiv, 2023 (South China University of Technology). [Paper]
  • LangNav: "LangNav: Language as a Perceptual Representation for Navigation", arXiv, 2023 (MIT). [Paper]
  • ?: "Multimodal Large Language Model for Visual Navigation", arXiv, 2023 (Apple). [Paper]
  • VLN-Video: "VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation", AAAI, 2024 (Amazon). [Paper]
  • MemoNav: "MemoNav: Working Memory Model for Visual Navigation", CVPR, 2024 (CAS). [Paper]
  • OVER-NAV: "OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation", CVPR, 2024 (HKU). [Paper]
  • HNR: "Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation", CVPR, 2024 (CAS). [Paper][Code (in construction)]
  • GOAT: "Vision-and-Language Navigation via Causal Learning", CVPR, 2024 (Tongji University). [Paper][PyTorch]
  • MapGPT: "MapGPT: Map-Guided Prompting for Unified Vision-and-Language Navigation", arXiv, 2024 (HKU). [Paper]
  • V-IRL: "V-IRL: Grounding Virtual Intelligence in Real Life", arXiv, 2024 (NYU). [Paper][PyTorch (in construction)][Website]

[Back to Overview]

Other RL Tasks

  • SVEA: "Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation", arXiv, 2021 (UCSD). [Paper][GitHub][Website]
  • LocoTransformer: "Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers", ICLR, 2022 (UCSD). [Paper][Website]
  • STAM: "Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes", CVPR, 2022 (McGill University, Canada). [Paper][PyTorch]
  • CtrlFormer: "CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer", ICML, 2022 (HKU). [Paper][PyTorch][Website]
  • PromptDT: "Prompting Decision Transformer for Few-Shot Policy Generalization", ICML, 2022 (CMU). [Paper][Website]
  • StARformer: "StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning", ECCV, 2022 (Stony Brook). [Paper][PyTorch]
  • RAD: "Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels", arXiv, 2022 (UBC, Canada). [Paper]
  • MWM: "Masked World Models for Visual Control", arXiv, 2022 (Berkeley). [Paper][Tensorflow][Website]
  • IRIS: "Transformers are Sample Efficient World Models", arXiv, 2022 (University of Geneva, Switzerland). [Paper][PyTorch]
  • InstructRL: "Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models", arXiv, 2022 (Google). [Paper]
  • STG-Transformer: "Learning from Visual Observation via Offline Pretrained State-to-Go Transformer", NeurIPS, 2023 (BAAI). [Paper][Code (in construction)][Website]
  • RL4VLM: "Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning", arXiv, 2024 (Berkeley + NYU). [Paper][PyTorch][Website]

[Back to Overview]

Medical

Medical Segmentation

  • Cross-Transformer: "The entire network structure of Crossmodal Transformer", ICBSIP, 2021 (Capital Medical University). [Paper]
  • Segtran: "Medical Image Segmentation using Squeeze-and-Expansion Transformers", IJCAI, 2021 (A*STAR). [Paper]
  • i-ViT: "Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image", MICCAI, 2021 (Xi'an Jiaotong University). [Paper][PyTorch][Website]
  • UTNet: "UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation", MICCAI, 2021 (Rutgers). [Paper]
  • MCTrans: "Multi-Compound Transformer for Accurate Biomedical Image Segmentation", MICCAI, 2021 (HKU + CUHK). [Paper][Code (in construction)]
  • Polyformer: "Few-Shot Domain Adaptation with Polymorphic Transformers", MICCAI, 2021 (A*STAR). [Paper][PyTorch]
  • BA-Transformer: "Boundary-aware Transformers for Skin Lesion Segmentation". MICCAI, 2021 (Xiamen University). [Paper][PyTorch]
  • GT-U-Net: "GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation", MICCAIW, 2021 (Hangzhou Dianzi University). [Paper][PyTorch]
  • STN: "Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation", ISBI, 2021 (Institut Polytechnique de Paris). [Paper]
  • T-AutoML: "T-AutoML: Automated Machine Learning for Lesion Segmentation Using Transformers in 3D Medical Imaging", ICCV, 2021 (NVIDIA). [Paper]
  • MedT: "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
  • Convolution-Free: "Convolution-Free Medical Image Segmentation using Transformers", arXiv, 2021 (Harvard). [Paper]
  • CoTR: "CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation", arXiv, 2021 (Northwestern Polytechnical University). [Paper][PyTorch]
  • TransBTS: "TransBTS: Multimodal Brain Tumor Segmentation Using Transformer", arXiv, 2021 (University of Science and Technology Beijing). [Paper][PyTorch]
  • SpecTr: "SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation", arXiv, 2021 (East China Normal University). [Paper][Code (in construction)]
  • U-Transformer: "U-Net Transformer: Self and Cross Attention for Medical Image Segmentation", arXiv, 2021 (CEDRIC). [Paper]
  • TransUNet: "TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
  • PMTrans: "Pyramid Medical Transformer for Medical Image Segmentation", arXiv, 2021 (Washington University in St. Louis). [Paper]
  • PBT-Net: "Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy", arXiv, 2021 (Hangzhou Dianzi University). [Paper]
  • Swin-Unet: "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation", arXiv, 2021 (Huawei). [Paper][Code (in construction)]
  • MBT-Net: "A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation", arXiv, 2021 (Southern University of Science and Technology). [Paper]
  • WAD: "More than Encoder: Introducing Transformer Decoder to Upsample", arXiv, 2021 (South China University of Technology). [Paper]
  • LeViT-UNet: "LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation", arXiv, 2021 (Wuhan Institute of Technology). [Paper]
  • ?: "Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation", arXiv, 2021 (Vanderbilt University). [Paper]
  • nnFormer: "nnFormer: Interleaved Transformer for Volumetric Segmentation", arXiv, 2021 (HKU + Xiamen University). [Paper][PyTorch]
  • MISSFormer: "MISSFormer: An Effective Medical Image Segmentation Transformer", arXiv, 2021 (Beijing University of Posts and Telecommunications). [Paper]
  • TUnet: "Transformer-Unet: Raw Image Processing with Unet", arXiv, 2021 (Beijing Zoezen Robot + Beihang University). [Paper]
  • BiTr-Unet: "BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation", arXiv, 2021 (New York University). [Paper]
  • ?: "Transformer Assisted Convolutional Network for Cell Instance Segmentation", arXiv, 2021 (IIT Dhanbad). [Paper]
  • ?: "Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining", arXiv, 2021 (Ukrainian Catholic University). [Paper]
  • UNETR: "UNETR: Transformers for 3D Medical Image Segmentation", WACV, 2022 (NVIDIA). [Paper][PyTorch]
  • AFTer-UNet: "AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation", WACV, 2022 (UC Irvine). [Paper]
  • UCTransNet: "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer", AAAI, 2022 (Northeastern University, China). [Paper][PyTorch]
  • Swin-UNETR: "Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis", CVPR, 2022 (NVIDIA). [Paper][PyTorch]
  • ?: "Transformer-based out-of-distribution detection for clinically safe segmentation", Medical Imaging with Deep Learning (MIDL), 2022 (King’s College London). [Paper]
  • ScaleFormer: "ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation", IJCAI, 2022 (Zhejiang University). [Paper][Code (in construction)]
  • FCBFormer: "FCN-Transformer Feature Fusion for Polyp Segmentation", Annual Conference on Medical Image Understanding and Analysis (MIUA), 2022 (University of Central Lancashire, UK). [Paper][PyTorch]
  • UAMT-ViT: "An uncertainty-aware transformer for MRI cardiac semantic segmentation via mean teachers", Medical Image Understanding and Analysis (MIUA), 2022 (Oxford). [Paper][PyTorch]
  • VDFormer: "View-Disentangled Transformer for Brain Lesion Detection", ISBI, 2022 (CUHK). [Paper][PyTorch]
  • TFCNs: "TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation", International Conference on Artificial Neural Networks (ICANN), 2022 (Xiamen University). [Paper][PyTorch (in construction)]
  • MIL: "Transformer based multiple instance learning for weakly supervised histopathology image segmentation", MICCAI, 2022 (Beihang University). [Paper]
  • mmFormer: "mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation", MICCAI, 2022 (CAS). [Paper][PyTorch]
  • Patcher: "Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation", MICCAI, 2022 (Pennsylvania State University). [Paper]
  • NestedFormer: "NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation", MICCAI, 2022 (Tianjin University). [Paper][Code (in construction)]
  • TransDeepLab: "TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical Image Segmentation", MICCAIW, 2022 (RWTH Aachen University, Germany). [Paper][PyTorch]
  • CESSViT: "Computationally-Efficient Vision Transformer for Medical Image Semantic Segmentation via Dual Pseudo-Label Supervision", ICIP, 2022 (Oxford). [Paper][PyTorch]
  • S4CVNet: "When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation", ECCVW, 2022 (Oxford). [Paper][PyTorch]
  • Video-TransUNet: "Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation", International Conference on Machine Vision (ICMV), 2022 (University of Bristol, UK). [Paper]
  • TransResNet: "TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting", BMVC, 2022 (MBZUAI). [Paper]
  • CAAViT: "Adversarial Vision Transformer for Medical Image Semantic Segmentation with Limited Annotations", BMVC, 2022 (Oxford). [Paper][PyTorch][Supp]
  • CASTformer: "Class-Aware Adversarial Transformers for Medical Image Segmentation", NeurIPS, 2022 (Yale). [Paper]
  • TransNorm: "TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model", IEEE Access, 2022 (Aachen University, Germany). [Paper][PyTorch]
  • Tempera: "Tempera: Spatial Transformer Feature Pyramid Network for Cardiac MRI Segmentation", arXiv, 2022 (ICL). [Paper]
  • UTNetV2: "A Multi-scale Transformer for Medical Image Segmentation: Architectures, Model Efficiency, and Benchmarks", arXiv, 2022 (Rutgers). [Paper]
  • UNesT: "Characterizing Renal Structures with 3D Block Aggregate Transformers", arXiv, 2022 (Vanderbilt University, Tennessee). [Paper]
  • PHTrans: "PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
  • UNeXt: "UNeXt: MLP-based Rapid Medical Image Segmentation Network", arXiv, 2022 (JHU). [Paper][PyTorch]
  • TransFusion: "TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers", arXiv, 2022 (Rutgers). [Paper]
  • UNetFormer: "UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation", arXiv, 2022 (NVIDIA). [Paper][GitHub]
  • 3D-Shuffle-Mixer: "3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
  • ?: "Continual Hippocampus Segmentation with Transformers", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
  • TranSiam: "TranSiam: Fusing Multimodal Visual Features Using Transformer for Medical Image Segmentation", arXiv, 2022 (Tianjin University). [Paper]
  • ColonFormer: "ColonFormer: An Efficient Transformer based Method for Colon Polyp Segmentation", arXiv, 2022 (Hanoi University of Science and Technology). [Paper]
  • ?: "Transformer based Generative Adversarial Network for Liver Segmentation", arXiv, 2022 (Northwestern University). [Paper]
  • FCT: "The Fully Convolutional Transformer for Medical Image Segmentation", arXiv, 2022 (University of Glasgow, UK). [Paper]
  • XBound-Former: "XBound-Former: Toward Cross-scale Boundary Modeling in Transformers", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • Polyp-PVT: "Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers", arXiv, 2022 (IIAI). [Paper][PyTorch]
  • SeATrans: "SeATrans: Learning Segmentation-Assisted diagnosis model via Transformer", arXiv, 2022 (Baidu). [Paper]
  • TransResU-Net: "TransResU-Net: Transformer based ResU-Net for Real-Time Colonoscopy Polyp Segmentation", arXiv, 2022 (Indira Gandhi National Open University). [Paper][Code (in construction)]
  • LViT: "LViT: Language meets Vision Transformer in Medical Image Segmentation", arXiv, 2022 (Alibaba). [Paper][Code (in construction)]
  • APFormer: "The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning", arXiv, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • ?: "Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images", arXiv, 2022 (University of Rennes, France). [Paper][Tensorflow]
  • CKD-TransBTS: "CKD-TransBTS: Clinical Knowledge-Driven Hybrid Transformer with Modality-Correlated Cross-Attention for Brain Tumor Segmentation", arXiv, 2022 (South China University of Technology). [Paper]
  • ?: "Contextual Attention Network: Transformer Meets U-Net", arXiv, 2022 (RWTH Aachen University). [Paper][PyTorch]
  • HRSTNet: "High-Resolution Swin Transformer for Automatic Medical Image Segmentation", arXiv, 2022 (Xi'an University of Posts and Telecommunications). [Paper][Code (in construction)]
  • CM-MLP: "CM-MLP: Cascade Multi-scale MLP with Axial Context Relation Encoder for Edge Segmentation of Medical Image", arXiv, 2022 (Zhengzhou University). [Paper]
  • CATS: "Cats: Complementary CNN and Transformer Encoders for Segmentation", arXiv, 2022 (Vanderbilt University, Nashville). [Paper]
  • TFusion: "TFusion: Transformer based N-to-One Multimodal Fusion Block", arXiv, 2022 (SouthChinaUniversityofTechnology). [Paper]
  • AutoPET: "AutoPET Challenge: Combining nn-Unet with Swin UNETR Augmented by Maximum Intensity Projection Classifier", arXiv, 2022 (University Hospital Essen, Germany). [Paper]
  • SPAN: "Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers", arXiv, 2022 (Berkeley). [Paper]
  • TMSS: "TMSS: An End-to-End Transformer-based Multimodal Network for Segmentation and Survival Prediction", arXiv, 2022 (MBZUAI). [Paper]
  • CR-Swin2-VT: "Hybrid Window Attention Based Transformer Architecture for Brain Tumor Segmentation", arXiv, 2022 (Monash University). [Paper][PyTorch]
  • FocalUNETR: "FocalUNETR: A Focal Transformer for Boundary-aware Segmentation of CT Images", arXiv, 2022 (Wayne State University, Detroit). [Paper]
  • LAPFormer: "LAPFormer: A Light and Accurate Polyp Segmentation Transformer", arXiv, 2022 (Sun* Inc, Hanoi). [Paper]
  • FINE: "Memory transformers for full context and high-resolution 3D Medical Segmentation", arXiv, 2022 (National Conservatory of Arts and Crafts, France). [Paper]
  • ConvTransSeg: "ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation", arXiv, 2022 (University of Nottingham, UK). [Paper]
  • CS-Unet: "Optimizing Vision Transformers for Medical Image Segmentation and Few-Shot Domain Adaptation", arXiv, 2022 (University of Glasgow, UK). [Paper]
  • UNETR++: "UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
  • HiFormer: "HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation", WACV, 2023 (Iran University of Science and Technology). [Paper][PyTorch]
  • Att-SwinU-Net: "Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin Lesion Segmentation", IEEE ISBI, 2023 (Shahid Beheshti University, Iran). [Paper][PyTorch]
  • 3DUX-Net: "3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation", ICLR, 2023 (Vanderbilt University). [Paper][PyTorch]
  • ?: "Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization", CVPR, 2023 (Alibaba). [Paper]
  • CVM: "Weakly supervised segmentation with point annotations for histopathology images via contrast-based variational model", CVPR, 2023 (University of Liverpool, UK). [Paper]
  • MAESTER: "MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition", CVPR, 2023 (University of Toronto). [Paper]
  • Universal-Model: "CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection", ICCV, 2023 (JHU). [Paper][PyTorch]
  • MDViT: "MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets", MICCAI, 2023 (UBC). [Paper][PyTorch]
  • ConvFormer: "ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation", MICCAI, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • TP-SIS: "Text Promptable Surgical Instrument Segmentation with Vision-Language Models", NeurIPS, 2023 (King's College London). [Paper][PyTorch]
  • UniSeg: "UniSeg: A Prompt-driven Universal Segmentation Model as well as A Strong Representation Learner", arXiv, 2023 (Northwestern Polytechnical University, China). [Paper][PyTorch (in construction)]
  • UniverSeg: "UniverSeg: Universal Medical Image Segmentation", arXiv, 2023 (MIT). [Paper][PyTorch][Website]
  • 3DSAM-adapter: "3DSAM-adapter: Holistic Adaptation of SAM from 2D to 3D for Promptable Medical Image Segmentation", arXiv, 2023 (CUHK). [Paper]
  • CMCL: "Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training", arXiv, 2023 (NVIDIA). [Paper]
  • AdaptiveSAM: "AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation", arXiv, 2023 (JHU). [Paper][PyTorch]
  • SAM-Med2D: "SAM-Med2D", arXiv, 2023 (Shanghai AI Lab). [Paper][Pytorch]
  • SAM-Med3D: "SAM-Med3D", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • H-SAM: "Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding", CVPR, 2024 (East China Normal University). [Paper][PyTorch]

[Back to Overview]

Medical Classification

  • COVID19T: "A Transformer-Based Framework for Automatic COVID19 Diagnosis in Chest CTs", ICCVW, 2021 (?). [Paper][PyTorch]
  • TransMIL: "TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication", NeurIPS, 2021 (Tsinghua University). [Paper][PyTorch]
  • TransMed: "TransMed: Transformers Advance Multi-modal Medical Image Classification", arXiv, 2021 (Northeastern University). [Paper]
  • CXR-ViT: "Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification", arXiv, 2021 (KAIST). [Paper]
  • ViT-TSA: "Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer", arXiv, 2021 (Queen’s University). [Paper]
  • GasHis-Transformer: "GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification", arXiv, 2021 (Northeastern University). [Paper]
  • POCFormer: "POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound", arXiv, 2021 (The Ohio State University). [Paper]
  • COVID-ViT: "COVID-VIT: Classification of COVID-19 from CT chest images based on vision transformer models", arXiv, 2021 (Middlesex University, UK). [Paper][PyTorch]
  • EEG-ConvTransformer: "EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification", arXiv, 2021 (IIT Ropar). [Paper]
  • CCAT: "Visual Transformer with Statistical Test for COVID-19 Classification", arXiv, 2021 (NCKU). [Paper]
  • M3T: "M3T: Three-Dimensional Medical Image Classifier Using Multi-Plane and Multi-Slice Transformer", CVPR, 2022 (Yonsei University). [Paper]
  • ?: "A comparative study between vision transformers and CNNs in digital pathology", CVPRW, 2022 (Roche, Switzerland). [Paper]
  • SCT: "Context-Aware Transformers For Spinal Cancer Detection and Radiological Grading", MICCAI, 2022 (Oxford). [Paper]
  • KAT: "Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification", MICCAI, 2022 (Beihang University). [Paper][PyTorch]
  • SEViT: "Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification", MICCAI, 2022 (MBZUAI). [Paper][PyTorch]
  • MF-ViT: "Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis", MICCAIW, 2022 (Rutgers University). [Paper][PyTorch]
  • SB-SSL: "SB-SSL: Slice-Based Self-Supervised Transformers for Knee Abnormality Classification from MRI", MICCAIW, 2022 (University of Surrey, UK). [Paper]
  • RadioTransformer: "RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification", ECCV, 2022 (Stony Brook). [Paper][Tensorflow (in construction)]
  • ScoreNet: "ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification", arXiv, 2022 (EPFL). [Paper]
  • LA-MIL: "Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction", arXiv, 2022 (TUM). [Paper]
  • HoVer-Trans: "HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images", arXiv, 2022 (South China University of Technology). [Paper]
  • GTP: "A graph-transformer for whole slide image classification", IEEE Transactions on Medical Imaging (TMI), 2022 (Boston University). [Paper][PyTorch]
  • ?: "Zero-Shot and Few-Shot Learning for Lung Cancer Multi-Label Classification using Vision Transformer", arXiv, 2022 (Harvard). [Paper]
  • SwinCheX: "SwinCheX: Multi-label classification on chest X-ray images with transformers", arXiv, 2022 (Sharif University of Technology, Iran). [Paper]
  • SGT: "Rectify ViT Shortcut Learning by Visual Saliency", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
  • IPMN-ViT: "Neural Transformers for Intraductal Papillary Mucosal Neoplasms (IPMN) Classification in MRI images", arXiv, 2022 (University of Catania, Italy). [Paper]
  • ?: "Multi-Label Retinal Disease Classification using Transformers", arXiv, 2022 (Khalifa University, UAE). [Paper][PyTorch]
  • TractoFormer: "TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers", arXiv, 2022 (Harvard). [Paper]
  • BrainFormer: "BrainFormer: A Hybrid CNN-Transformer Model for Brain fMRI Data Classification", arXiv, 2022 (Chinese PLA General Hospital). [Paper]
  • SI-ViT: "Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification", arXiv, 2022 (Beihang University). [Paper][PyTorch]
  • IPS: "Iterative Patch Selection for High-Resolution Image Recognition", ICLR, 2023 (Hasso Plattner Institute, Germany). [Paper]
  • ILRA-MIL: "Exploring Low-Rank Property in Multiple Instance Learning for Whole Slide Image Classification", ICLR, 2023 (Tencent). [Paper]
  • BolT: "BolT: Fused window transformers for fMRI time series analysis", Medical Image Analysis, 2023 (Bilkent University). [Paper][PyTorch]
  • TOP: "The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification", NeurIPS, 2023 (Fudan). [Paper][Code (in construction)]
  • DreaMR: "DreaMR: Diffusion-driven Counterfactual Explanation for Functional MRI", arXiv, 2023 (Bilkent University). [Paper][PyTorch]
  • LongViT: "When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology", arXiv, 2023 (Microsoft). [Paper][PyTorch]
  • FiVE: "Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction", CVPR, 2024 (Xiamen University). [Paper]
  • FocusMAE: "FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders", CVPR, 2024 (IIT Delhi). [Paper][Code (in construction)]

[Back to Overview]

Medical Detection

  • COTR: "COTR: Convolution in Transformer Network for End to End Polyp Detection", arXiv, 2021 (Fuzhou University). [Paper]
  • TR-Net: "Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries", arXiv, 2021 (Harbin Institute of Technology). [Paper]
  • CAE-Transformer: "CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans", arXiv, 2021 (Concordia University, Canada). [Paper]
  • SwinFPN: "SwinFPN: Leveraging Vision Transformers for 3D Organs-At-Risk Detection", MIDL, 2022 (TUM). [Paper][PyTorch]
  • DATR: "DATR: Domain-adaptive transformer for multi-domain landmark detection", arXiv, 2022 (CAS). [Paper]
  • SATr: "SATr: Slice Attention with Transformer for Universal Lesion Detection", arXiv, 2022 (CAS). [Paper]
  • AC-Former: "Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
  • PGT: "Prompt-based Grouping Transformer for Nucleus Detection and Classification", MICCAI, 2023 (Sun Yat-sen University). [Paper][PyTorch]
  • Focused-Decoder: "Focused Decoding Enables 3D Anatomical Detection by Transformers", MELBA, 2023 (University of Zurich). [Paper][PyTorch][Website]

[Back to Overview]

Medical Reconstruction

  • T2Net: "Task Transformer Network for Joint MRI Reconstruction and Super-Resolution", MICCAI, 2021 (Harbin Institute of Technology). [Paper][PyTorch]
  • FIT: "Fourier Image Transformer", arXiv, 2021 (MPI). [Paper][PyTorch]
  • SLATER: "Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers", arXiv, 2021 (Bilkent University). [Paper]
  • MTrans: "MTrans: Multi-Modal Transformer for Accelerated MR Imaging", arXiv, 2021 (Harbin Institute of Technology). [Paper][PyTorch]
  • SDAUT: "Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI", MICCAI, 2022 (ICL). [Paper]
  • ?: "Adaptively Re-weighting Multi-Loss Untrained Transformer for Sparse-View Cone-Beam CT Reconstruction", arXiv, 2022 (Zhejiang Lab). [Paper]
  • K-Space-Transformer: "K-Space Transformer for Fast MRI Reconstruction with Implicit Representation", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][Code (in construction)][Website]
  • McSTRA: "Multi-head Cascaded Swin Transformers with Attention to k-space Sampling Pattern for Accelerated MRI Reconstruction", arXiv, 2022 (Monash University, Australia). [Paper]
  • ?: "Colonoscopy Landmark Detection using Vision Transformers", arXiv, 2022 (Intuitive Surgical, CA). [Paper]
  • FedPR: "Learning Federated Visual Prompt in Null Space for MRI Reconstruction", CVPR, 2023 (A*STAR). [Paper][PyTorch]
  • ?: "Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities", NeurIPS, 2023 (KU Leuven). [Paper][PyTorch]
  • ?: "Brain encoding models based on multimodal transformers can transfer across language and vision", NeurIPS, 2023 (UT Austin). [Paper]
  • MinD-Video: "Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity", NeurIPS, 2023 (NUS). [Paper][PyTorch][Website]
  • SAX-NeRF: "Structure-Aware Sparse-View X-ray 3D Reconstruction", CVPR, 2024 (JHU). [Paper][PyTorch]

[Back to Overview]

Medical Low-Level Vision

  • Eformer: "Eformer: Edge Enhancement based Transformer for Medical Image Denoising", ICCV, 2021 (BITS Pilani, India). [Paper]
  • PTNet: "PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer", arXiv, 2021 (* Columbia *). [Paper]
  • ResViT: "ResViT: Residual vision transformers for multi-modal medical image synthesis", arXiv, 2021 (Bilkent University, Turkey). [Paper]
  • CyTran: "CyTran: Cycle-Consistent Transformers for Non-Contrast to Contrast CT Translation", arXiv, 2021 (University Politehnica of Bucharest, Romania). [Paper][PyTorch]
  • McMRSR: "Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution", CVPR, 2022 (Yantai University, China). [Paper][PyTorch]
  • RPLHR-CT: "RPLHR-CT Dataset and Transformer Baseline for Volumetric Super-Resolution from CT Scans", MICCAI, 2022 (Infervision Medical Technology, China). [Paper][Code (in construction)]
  • W-G2L-ART: "Wide Range MRI Artifact Removal with Transformers", BMVC, 2022 (KTH). [Paper]
  • RFormer: "RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark", arXiv, 2022 (Tsinghua). [Paper]
  • CTformer: "CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising", arXiv, 2022 (UMass Lowell). [Paper][PyTorch]
  • Cohf-T: "Cross-Modality High-Frequency Transformer for MR Image Super-Resolution", arXiv, 2022 (Xidian University). [Paper]
  • SIST: "Low-Dose CT Denoising via Sinogram Inner-Structure Transformer", arXiv, 2022 (?). [Paper]
  • Spach-Transformer: "Spach Transformer: Spatial and Channel-wise Transformer Based on Local and Global Self-attentions for PET Image Denoising", arXiv, 2022 (Harvard). [Paper]
  • ConvFormer: "ConvFormer: Combining CNN and Transformer for Medical Image Segmentation", arXiv, 2022 (University of Notre Dame). [Paper]
  • ?: "Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers", ICCV, 2023 (Durham University, UK). [Paper]

[Back to Overview]

Medical Vision-Language

  • CGT: "Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation", CVPR, 2022 (University of Technology Sydney). [Paper]
  • MCGN: "A Medical Semantic-Assisted Transformer for Radiographic Report Generation", MICCAI, 2022 (University of Sydney). [Paper]
  • M3AE: "Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training", MICCAI, 2022 (CUHK). [Paper][PyTorch]
  • BioViL: "Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing", ECCV, 2022 (Microsoft). [Paper][Code]
  • MGCA: "Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning", NeurIPS, 2022 (HKU). [Paper]
  • MedCLIP: "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text", EMNLP, 2022 (UIUC). [Paper][PyTorch]
  • MDBERT: "Hierarchical BERT for Medical Document Understanding", arXiv, 2022 (IQVIA, NC). [Paper]
  • Surgical-VQA: "Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer", arXiv, 2022 (NUS). [Paper][PyTorch (in construction)]
  • SwinMLP-TranCAP: "Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches", arXiv, 2022 (CUHK). [Paper][PyTorch]
  • SAT: "Medical Image Captioning via Generative Pretrained Transformers", arXiv, 2022 (Philips Innovation Labs Rus, Russia). [Paper]
  • RepsNet: "RepsNet: Combining Vision with Language for Automated Medical Reports", arXiv, 2022 (Google). [Paper][Website]
  • MF2-MVQA: "MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering", arXiv, 2022 (University of Science and Technology Beijing). [Paper]
  • RoentGen: "RoentGen: Vision-Language Foundation Model for Chest X-ray Generation", arXiv, 2022 (Stanford). [Paper]
  • ?: "Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study", ICLR, 2023 (Sichuan University). [Paper]
  • METransformer: "METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens", CVPR, 2023 (University of Sydney). [Paper]
  • MI-Zero: "Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images", CVPR, 2023 (Harvard). [Paper]
  • KiUT: "KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation", CVPR, 2023 (Shanghai AI Lab). [Paper]
  • BioViL-T: "Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing", CVPR, 2023 (Microsoft). [Paper]
  • ?: "Evidential Interactive Learning for Medical Image Captioning", ICML, 2023 (Rochester Institute of Technology, NY). [Paper]
  • PRIOR: "PRIOR: Prototype Representation Joint Learning from Medical Images and Reports", ICCV, 2023 (Southern University of Science and Technology). [Paper][Code (in construction)]
  • MedKLIP: "MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
  • PTUnifier: "Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts", ICCV, 2023 (CUHK). [Paper][PyTorch]
  • ?: "Localized Questions in Medical Visual Question Answering", MICCAI, 2023 (University of Bern, Switzerland). [Paper]
  • CXR-CLIP: "CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training", MICCAI, 2023 (Kakao). [Paper]
  • LLaVA-Med: "LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day", NeurIPS (Datasets and Benchmarks), 2023 (Microsoft). [Paper][PyTorch]
  • Med-UniC: "Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias", NeurIPS, 2023 (OSU). [Paper][PyTorch]
  • EHRXQA: "EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images", NeurIPS (Datasets and Benchmarks), 2023 (KAIST). [Paper][Code]
  • Quilt: "Quilt-1M: One Million Image-Text Pairs for Histopathology", NeurIPS, 2023 (UW). [Paper]
  • RAMM: "RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training", arXiv, 2023 (Alibaba). [Paper]
  • PT: "Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models", arXiv, 2023 (University of Amsterdam). [Paper]
  • PMC-CLIP: "PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
  • Q2ATransformer: "Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder", arXiv, 2023 (The University of Sydney). [Paper]
  • PMC-VQA: "PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)][Website]
  • MedBLIP: "MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
  • GTGM: "Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation", arXiv, 2023 (USTC). [Paper]
  • XrayGPT: "XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models", arXiv, 2023 (MBZUAI). [Paper][PyTorch]
  • CONCH: "Towards a Visual-Language Foundation Model for Computational Pathology", arXiv, 2023 (Harvard). [Paper]
  • Med-Flamingo: "Med-Flamingo: a Multimodal Medical Few-shot Learner", arXiv, 2023 (Stanford). [Paper][PyTorch]
  • ?: "Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis", arXiv, 2023 (Shanghai AI Lab). [Paper][GitHub]
  • CLIP-MUSED: "CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding", ICLR, 2024 (CAS). [Paper][PyTorch]
  • MAVL: "Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Matching Framework", CVPR, 2024 (University of Adelaide). [Paper][PyTorch]
  • FairCLIP: "FairCLIP: Harnessing Fairness in Vision-Language Learning", CVPR, 2024 (Harvard). [Paper]
  • RAD-DINO: "RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision", arXiv, 2024 (Microsoft). [Paper]
  • Med-Gemini: "Advancing Multimodal Medical Capabilities of Gemini", arXiv, 2024 (Google). [Paper]
  • EVA-X: "EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch]

[Back to Overview]

Medical Others

  • LAT: "Lesion-Aware Transformers for Diabetic Retinopathy Grading", CVPR, 2021 (USTC). [Paper]
  • UVT: "Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation", MICCAI, 2021 (ICL). [Paper][PyTorch]
  • ?: "Surgical Instruction Generation with Transformers", MICCAI, 2021 (Bournemouth University, UK). [Paper]
  • AlignTransformer: "AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation", MICCAI, 2021 (Peking University). [Paper]
  • MCAT: "Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images", ICCV, 2021 (Harvard). [Paper][PyTorch]
  • ?: "Is it Time to Replace CNNs with Transformers for Medical Images?", ICCVW, 2021 (KTH, Sweden). [Paper]
  • HAT-Net: "HAT-Net: A Hierarchical Transformer Graph Neural Network for Grading of Colorectal Cancer Histology Images", BMVC, 2021 (Beijing University of Posts and Telecommunications). [Paper]
  • ?: "Federated Split Vision Transformer for COVID-19 CXR Diagnosis using Task-Agnostic Training", NeurIPS, 2021 (KAIST). [Paper]
  • ViT-Path: "Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology", NeurIPSW, 2021 (Microsoft). [Paper]
  • Global-Local-Transformer: "Global-Local Transformer for Brain Age Estimation", IEEE Transactions on Medical Imaging, 2021 (Harvard). [Paper][PyTorch]
  • CE-TFE: "Deep Transformers for Fast Small Intestine Grounding in Capsule Endoscope Video", arXiv, 2021 (Sun Yat-Sen University). [Paper]
  • DeepProg: "DeepProg: A Transformer-based Framework for Predicting Disease Prognosis", arXiv, 2021 (University of Oulu). [Paper]
  • Medical-Transformer: "Medical Transformer: Universal Brain Encoder for 3D MRI Analysis", arXiv, 2021 (Korea University). [Paper]
  • RATCHET: "RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting", arXiv, 2021 (ICL). [Paper]
  • C2FViT: "Affine Medical Image Registration with Coarse-to-Fine Vision Transformer", CVPR, 2022 (HKUST). [Paper][Code (in construction)]
  • HIPT: "Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning", CVPR, 2022 (Harvard). [Paper]
  • SiT: "Surface Analysis with Vision Transformers", CVPRW, 2022 (King’s College London, UK). [Paper][PyTorch]
  • SiT: "Surface Vision Transformers: Attention-Based Modelling applied to Cortical Analysis", Medical Imaging with Deep Learning (MIDL), 2022 (King’s College London, UK). [Paper]
  • ViT-V-Net: "ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration", ICML, 2022 (JHU). [Paper][PyTorch]
  • HybridStereoNet: "Deep Laparoscopic Stereo Matching with Transformers", MICCAI, 2022 (Monash University, Australia). [Paper][PyTorch]
  • BabyNet: "BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video", MICCAI, 2022 (Sano Centre for Computational Medicine, Poland). [Paper][PyTorch]
  • TLT: "Transformer Lesion Tracker", MICCAI, 2022 (InferVision Medical Technology, China). [Paper]
  • XMorpher: "XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention", MICCAI, 2022 (Southeast University, China). [Paper][PyTorch]
  • SVoRT: "SVoRT: Iterative Transformer for Slice-to-Volume Registration in Fetal Brain MRI", MICCAI, 2022 (MIT). [Paper]
  • GaitForeMer: "GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation", MICCAI, 2022 (Stanford). [Paper][PyTorch]
  • LKU-Net: "U-Net vs Transformer: Is U-Net Outdated in Medical Image Registration?", MICCAIW, 2022 (University of Birmingham, UK). [Paper]
  • LVOT: "Shifted Windows Transformers for Medical Image Quality Assessment", MICCAIW, 2022 (Istanbul Technical University, Turkey). [Paper]
  • MINiT: "Multiple Instance Neuroimage Transformer", MICCAIW, 2022 (Stanford). [Paper][Code (in construction)]
  • BrainNetTF: "Brain Network Transformer", NeurIPS, 2022 (Emory University). [Paper][PyTorch]
  • SiT: "Surface Vision Transformers: Flexible Attention-Based Modelling of Biomedical Surfaces", arXiv, 2022 (King’s College London, UK). [Paper][PyTorch]
  • TransMorph: "TransMorph: Transformer for unsupervised medical image registration", arXiv, 2022 (JHU). [Paper]
  • SymTrans: "Symmetric Transformer-based Nwholeetwork for Unsupervised Image Registration", arXiv, 2022 (Jilin University). [Paper]
  • MMT: "One Model to Synthesize Them All: Multi-contrast Multi-scale Transformer for Missing Data Imputation", arXiv, 2022 (JHU). [Paper]
  • EG-ViT: "Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning", arXiv, 2022 (Northwestern Polytechnical University). [Paper]
  • CSM: "Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection", arXiv, 2022 (University of Adelaide, Australia). [Paper]
  • CASHformer: "CASHformer: Cognition Aware SHape Transformer for Longitudinal Analysis", arXiv, 2022 (TUM). [Paper]
  • ARST: "ARST: Auto-Regressive Surgical Transformer for Phase Recognition from Laparoscopic Videos", arXiv, 2022 (Shanghai Jiao Tong University). [Paper]
  • SSiT: "SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading", arXiv, 2022 (Southern University of Science and Techonology, China). [Paper][Code (in construction)]
  • MulGT: "MulGT: Multi-task Graph-Transformer with Task-aware Knowledge Injection and Domain Knowledge-driven Pooling for Whole Slide Image Analysis", AAAI, 2023 (HKU). [Paper]
  • HVTSurv: "HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image", AAAI, 2023 (Tsinghua). [Paper][PyTorch]
  • AMIGO: "AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images", CVPR, 2023 (UBC). [Paper]
  • ACAT: "ACAT: Adversarial Counterfactual Attention for Classification and Detection in Medical Imaging", ICML, 2023 (University of Edinburgh, UK). [Paper]
  • ConSlide: "ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis", ICCV, 2023 (HKU). [Paper]
  • MOTCat: "Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction", ICCV, 2023 (HKUST). [Paper][PyTorch]
  • ViT-DAE: "ViT-DAE: Transformer-driven Diffusion Autoencoder for Histopathology Image Analysis", arXiv, 2023 (Stony Brook). [Paper]
  • LoRKD: "Low-Rank Knowledge Decomposition for Medical Foundation Models", CVPR, 2024 (SJTU). [Paper][Code (in construction)]

[Back to Overview]

Other Tasks

  • Active Learning:
    • TJLS: "Visual Transformer for Task-aware Active Learning", arXiv, 2021 (ICL). [Paper][PyTorch]
  • Agriculture:
    • PlantXViT: "Explainable vision transformer enabled convolutional neural network for plant disease identification: PlantXViT", arXiv, 2022 (Indian Institute of Information Technology). [Paper]
    • MMST-ViT: "MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer", ICCV, 2023 (University of Delaware, Delaware). [Paper][PyTorch]
  • Aesthetic:
    • CSKD: "CLIP Brings Better Features to Visual Aesthetics Learners", arXiv, 2023 (OPPO). [Paper]
    • AesBench: "AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception", arXiv, 2024 (Xidian University). [Paper][GitHub]
  • Animation-related:
    • AnT: "The Animation Transformer: Visual Correspondence via Segment Matching", ICCV, 2021 (Cadmium). [Paper]
    • AniFormer: "AniFormer: Data-driven 3D Animation with Transformer", BMVC, 2021 (University of Oulu, Finland). [Paper][PyTorch]
  • Bird's Eye View (BEV):
    • ViT-BEVSeg: "ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation", IJCNN, 2022 (Maynooth University, Ireland). [Paper][Code (in construction)]
    • BEVFormer: "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
    • CoBEVT: "CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers", CoRL, 2022 (UCLA). [Paper][PyTorch]
    • GKT: "Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer", arXiv, 2022 (Huazhong University of Science and Technology). [Paper][Code (in construction)]
    • BEVSegFormer: "BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs", WACV, 2023 (Nullmax, China). [Paper]
    • BEVDistill: "BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection", ICLR, 2023 (USTC). [Paper][Code (in constrcution)]
    • BEVFormer-v2: "BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision", CVPR, 2023 (Tsinghua University). [Paper]
    • BEV-SAN: "BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks", CVPR, 2023 (Peking University). [Paper]
    • BEVGuide: "BEV-Guided Multi-Modality Fusion for Driving Perception", CVPR, 2023 (UIUC). [Paper][Code (in construction)][Website]
    • FB-OCC: "FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation", CVPRW, 2023 (NVIDIA). [Paper][Code (in construction)]
    • FB-BEV: "FB-BEV: BEV Representation from Forward-Backward View Transformations", ICCV, 2023 (NVIDIA). [Paper][Code (in construction)]
    • BEV-DG: "BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation", ICCV, 2023 (Xiamen University). [Paper]
    • UniTR: "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation", ICCV, 2023 (Peking). [Paper][PyTorch]
    • SparseBEV: "SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos", ICCV, 2023 (Nanjing University). [Paper][Code (in construction)]
    • OCBEV: "OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • FusionFormer: "FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection", arXiv, 2023 (Cainiao Network, China). [Paper]
    • Talk2BEV: "Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving", arXiv, 2023 (IIIT Hyderabad). [Paper][Code][Website]
    • SparseOcc: "Fully Sparse 3D Panoptic Occupancy Prediction", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • CLIP-BEVFormer: "CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow", CVPR, 2024 (Bosch). [Paper]
    • TaDe: "Improving Bird's Eye View Semantic Segmentation by Task Decomposition", CVPR, 2024 (Wuhan University). [Paper][Code (in construction)]
  • Biology:
    • ?: "A State-of-the-art Survey of Object Detection Techniques in Microorganism Image Analysis: from Traditional Image Processing and Classical Machine Learning to Current Deep Convolutional Neural Networks and Potential Visual Transformers", arXiv, 2021 (Northeastern University). [Paper]
  • Brain Score:
    • CrossViT: "Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4", CVPRW, 2022 (MIT). [Paper][PyTorch]
  • Camera-related:
    • CTRL-C: "CTRL-C: Camera calibration TRansformer with Line-Classification", ICCV, 2021 (Kakao + Kookmin University). [Paper][PyTorch]
    • MS-Transformer: "Learning Multi-Scene Absolute Pose Regression with Transformers", ICCV, 2021 (Bar-Ilan University, Israel). [Paper][PyTorch]
    • GTCaR: "GTCaR: Graph Transformer for Camera Re-localization", ECCV, 2022 (Magic Leap). [Paper]
    • ?: "Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer", ICCV, 2023 (ANU). [Paper]
  • Change Detection:
    • MapFormer: "MapFormer: Boosting Change Detection by Using Pre-change Information", ICCV, 2023 (LMU Munich). [Paper][PyTorch]
  • Character/Text Recognition:
    • BTTR: "Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer", arXiv, 2021 (Peking). [Paper]
    • TrOCR: "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models", arXiv, 2021 (Microsoft). [Paper][PyTorch]
    • ?: "Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks", arXiv, 2021 (Salesforce). [Paper]
    • T3: "TrueType Transformer: Character and Font Style Recognition in Outline Format", Document Analysis Systems (DAS), 2022 (Kyushu University). [Paper]
    • ?: "Transformer-based HTR for Historical Documents", ComHum, 2022 (University of Zurich, Switzerland). [Paper]
    • ?: "SVG Vector Font Generation for Chinese Characters with Transformer", ICIP, 2022 (The University of Tokyo). [Paper]
    • LP-Transformer: "Forensic License Plate Recognition with Compression-Informed Transformers", ICIP, 2022 (University of Erlangen-Nurnberg, Germany). [Paper]
    • CoMER: "CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • MATRN: "Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features", ECCV, 2022 (KAIST). [Paper][PyTorch]
    • CONSENT: "CONSENT: Context Sensitive Transformer for Bold Words Classification", arXiv, 2022 (Amazon). [Paper]
    • DeepVecFont-v2: "DeepVecFont-v2: Exploiting Transformers to Synthesize Vector Fonts with Higher Quality", CVPR, 2023 (Peking University). [Paper][Code (in construction)]
    • SVGformer: "SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers", CVPR, 2023 (Adobe). [Paper]
    • SIGA: "Self-supervised Implicit Glyph Attention for Text Recognition", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
    • LISTER: "LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition", ICCV, 2023 (Alibaba). [Paper][PyTorch]
    • CCR-CLIP: "Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning", ICCV, 2023 (Fudan). [Paper][PyTorch]
    • CLIPTER: "CLIPTER: Looking at the Bigger Picture in Scene Text Recognition", ICCV, 2023 (Amazon). [Paper]
    • CLIP4STR: "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model", arXiv, 2023 (Zhejiang University). [Paper]
  • Curriculum Learning:
    • SSTN: "Spatial Transformer Networks for Curriculum Learning", arXiv, 2021 (TU Kaiserslautern, Germany). [Paper]
  • Defect Classification:
    • MSHViT: "Multi-Scale Hybrid Vision Transformer and Sinkhorn Tokenizer for Sewer Defect Classification", CVPRW, 2022 (Aalborg University, Denmark). [Paper]
    • DefT: "Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
  • Digital Holography:
    • ?: "Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography", ICCCR, 2022 (UBFC, France). [Paper]
  • Disentangled representation:
    • VCT: "Visual Concepts Tokenization", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
  • E-Commerce:
    • WebShop: "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents", NeurIPS, 2022 (Princeton). [Paper][PyTorch][Website]
    • ECLIP: "Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce", CVPR, 2023 (ByteDance). [Paper]
  • Event data:
    • EvT: "Event Transformer: A sparse-aware solution for efficient event data processing", arXiv, 2022 (Universidad de Zaragoza, Spain). [Paper][PyTorch]
    • ETB: "Event Transformer", arXiv, 2022 (Nanjing University). [Paper]
    • RVT: "Recurrent Vision Transformers for Object Detection with Event Cameras", CVPR, 2023 (University of Zurich). [Paper]
    • Eventful-Transformer: "Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers", ICCV, 2023 (UW Madison). [Paper][PyTorch][Website]
    • GET: "GET: Group Event Transformer for Event-Based Vision", ICCV, 2023 (USTC). [Paper][PyTorch]
    • ?: "Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers", ICCV, 2023 (CUHK). [Paper][PyTorch]
    • SODFormer: "SODFormer: Streaming Object Detection with Transformer Using Events and Frames", TPAMI, 2023 (Peking). [Paper][PyTorch]
    • EventSAM: "Segment Any Events via Weighted Adaptation of Pivotal Tokens", arXiv, 2023 (Xidian University). [Paper][PyTorch (in construction)]
  • Fashion:
    • Kaleido-BERT: "Kaleido-BERT: Vision-Language Pre-training on Fashion Domain", CVPR, 2021 (Alibaba). [Paper][Tensorflow]
    • CIT: "Cloth Interactive Transformer for Virtual Try-On", arXiv, 2021 (University of Trento). [Paper][Code (in construction)]
    • ClothFormer: "ClothFormer: Taming Video Virtual Try-on in All Module", CVPR, 2022 (iQIYI). [Paper][Website]
    • FashionVLP: "FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback", CVPR, 2022 (Amazon). [Paper]
    • FashionViL: "FashionViL: Fashion-Focused Vision-and-Language Representation Learning", ECCV, 2022 (University of Surrey, UK). [Paper][PyTorch]
    • OutfitTransformer: "OutfitTransformer: Learning Outfit Representations for Fashion Recommendation", arXiv, 2022 (Amazon). [Paper]
    • FaD-VLP: "FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning", EMNLP, 2022 (Meta). [Paper]
    • Fashionformer: "Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition", ECCV, 2022 (Peking). [Paper][PyTorch]
    • FAME-ViL: "FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks", CVPR, 2023 (University of Surrey). [Paper][PyTorch]
    • FashionSAP: "FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training", CVPR, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
    • OpenFashionCLIP: "OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data", ICIAP, 2023 (UniMoRE, Italy). [Paper][PyTorch]
    • MVLT: "Masked Vision-Language Transformer in Fashion", Machine Intelligence Research, 2023 (Alibaba). [Paper][PyTorch]
    • UniDiff: "UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning", arXiv, 2023 (Sun Yat-Sen University). [Paper]
  • Feature Matching:
    • SuperGlue: "SuperGlue: Learning Feature Matching with Graph Neural Networks", CVPR, 2020 (Magic Leap). [Paper][PyTorch]
    • LoFTR: "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR, 2021 (Zhejiang University). [Paper][PyTorch][Website]
    • COTR: "COTR: Correspondence Transformer for Matching Across Images", ICCV, 2021 (UBC). [Paper]
    • CATs: "CATs: Cost Aggregation Transformers for Visual Correspondence", NeurIPS, 2021 (Yonsei University + Korea University). [Paper][PyTorch][Website]
    • TransforMatcher: "TransforMatcher: Match-to-Match Attention for Semantic Correspondence", CVPR, 2022 (POSTECH). [Paper]
    • ASpanFormer: "ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer", ECCV, 2022 (HKUST). [Paper][Website]
    • CATs++: "CATs++: Boosting Cost Aggregation with Convolutions and Transformers", arXiv, 2022 (Korea University). [Paper]
    • LoFTR-TensorRT: "Local Feature Matching with Transformers for low-end devices", arXiv, 2022 (?). [Paper][PyTorch]
    • MatchFormer: "MatchFormer: Interleaving Attention in Transformers for Feature Matching", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
    • OpenGlue: "OpenGlue: Open Source Graph Neural Net Based Pipeline for Image Matching", arXiv, 2022 (Ukrainian Catholic University). [Paper][PyTorch]
    • ParaFormer: "ParaFormer: Parallel Attention Transformer for Efficient Feature Matching", AAAI, 2023 (Southeast University, China). [Paper]
    • ASTR: "Adaptive Spot-Guided Transformer for Consistent Local Feature Matching", CVPR, 2023 (USTC). [Paper][Website]
    • ACTR: "Correspondence Transformers with Asymmetric Feature Learning and Matching Flow Super-Resolution", CVPR, 2023 (Fudan). [Paper][Code (in construction)]
    • D2Former: "D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers", CVPR, 2023 (USTC). [Paper]
    • PMatch: "PMatch: Paired Masked Image Modeling for Dense Geometric Matching", CVPR, 2023 (Michigan State). [Paper][Code (in construction)]
    • 2D3D-MATR: "2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds", ICCV, 2023 (National University of Defense Technology, China). [Paper][PyTorch (in construction)]
    • CasMTR: "Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints", ICCV, 2023 (Fudan). [Paper][PyTorch]
    • Fuse-ViT: "A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence", NeurIPS, 2023 (Google). [Paper][Website]
    • Diffusion-Hyperfeature: "Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
    • LDM-correspondence: "Unsupervised Semantic Correspondence Using Stable Diffusion", NeurIPS, 2023 (UBC). [Paper][PyTorch][Website]
    • VSFormer: "VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning", AAAI, 2024 (Wenzhou University). [Paper][Code (in construction)]
    • Efficient-LoFTR: "Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed", CVPR, 2024 (Zhejiang). [Paper][Code (in construction)][Website]
    • OmniGlue: "OmniGlue: Generalizable Feature Matching with Foundation Model Guidance", CVPR, 2024 (Google). [Paper][Tensorflow][Website]
  • Fine-grained:
    • ViT-FGVC: "Exploring Vision Transformers for Fine-grained Classification", CVPRW, 2021 (Universidad de Valladolid). [Paper]
    • FFVT: "Feature Fusion Vision Transformer for Fine-Grained Visual Categorization", BMVC, 2021 (Griffith University, Australia). [Paper][PyTorch]
    • TPSKG: "Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition", arXiv, 2021 (Beihang University). [Paper]
    • AFTrans: "A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition", arXiv, 2021 (Peking University). [Paper]
    • TransFG: "TransFG: A Transformer Architecture for Fine-grained Recognition", AAAI, 2022 (Johns Hopkins). [Paper][PyTorch]
    • DynamicMLP: "Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information", CVPR, 2022 (Megvii). [Paper][PyTorch]
    • SIM-Trans: "SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization", ACMMM, 2022 (Peking University). [Paper][PyTorch]
    • MetaFormer: "MetaFormer: A Unified Meta Framework for Fine-Grained Recognition", arXiv, 2022 (ByteDance). [Paper][PyTorch]
    • ViT-FOD: "ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator", arXiv, 2022 (Shandong University). [Paper]
    • PLEor: "Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator", CVPR, 2023 (Dalian University of Technology). [Paper]
    • MultitaskVLFM: "Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks", arXiv, 2023 (Conservatoire National des Arts et Métiers (CEDRIC) France). [Paper][PyTorch]
    • M2Former: "M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition", arXiv, 2023 (Dongguk University, Korea). [Paper]
    • MP-FGVC: "Delving into Multimodal Prompting for Fine-grained Visual Classification", arXiv, 2023 (Nanjing University of Science and Technology). [Paper]
    • HGCLIP: "HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding", arXiv, 2023 (Monash). [Paper][PyTorch]
    • FineR: "Democratizing Fine-grained Visual Recognition with Large Language Models", ICLR, 2024 (University of Trento). [Paper][Code (in construction)][Website]
    • Finer: "Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models", arXiv, 2024 (UIUC). [Paper]
  • Gait:
    • Gait-TR: "Spatial Transformer Network on Skeleton-based Gait Recognition", arXiv, 2022 (South China University of Technology). [Paper]
    • MMGaitFormer: "Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion", CVPR, 2023 (Beihang University). [Paper]
  • Gaze:
    • GazeTR: "Gaze Estimation using Transformer", arXiv, 2021 (Beihang University). [Paper][PyTorch]
    • HGTTR: "End-to-End Human-Gaze-Target Detection with Transformers", CVPR, 2022 (Shanghai Jiao Tong). [Paper]
    • MGTR: "MGTR: End-to-End Mutual Gaze Detection with Transformer", ACCV, 2022 (Nankai University). [Paper][PyTorch]
    • GLC: "In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation", arXiv, 2022 (Georgia Tech). [Paper][Website]
    • Gazeformer: "Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention", CVPR, 2023 (Stony Brook). [Paper][PyTorch]
    • Sharingan: "Sharingan: A Transformer-based Architecture for Gaze Following", arXiv, 2023 (EPFL). [Paper]
    • TransGOP: "TransGOP: Transformer-Based Gaze Object Prediction", AAAI, 2024 (Xi'an University of Architecture and Technology). [Paper]
    • CLIP-Gaze: "CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model", AAAI, 2024 (Hikvision). [Paper]
    • IG: "Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition", CVPR, 2024 (Sun Yat-sen University). [Paper][Website]
  • Geo-Localization:
    • EgoTR: "Cross-view Geo-localization with Evolving Transformer", arXiv, 2021 (Shenzhen University). [Paper]
    • TransGeo: "TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization", CVPR, 2022 (UCF). [Paper][PyTorch]
    • GAMa: "GAMa: Cross-view Video Geo-localization", ECCV, 2022 (UCF). [Paper][Code (in construction)]
    • TransLocator: "Where in the World is this Image? Transformer-based Geo-localization in the Wild", ECCV, 2022 (JHU). [Paper]
    • TransGCNN: "Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization", arXiv, 2022 (Southeast University, China). [Paper]
    • MGTL: "Mutual Generative Transformer Learning for Cross-view Geo-localization", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
    • GeoGuessNet: "Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes", CVPR, 2023 (UCF). [Paper][PyTorch (in construction)]
    • GeoCLIP: "GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization", NeurIPS, 2023 (UCF). [Paper]
  • Homography Estimation:
    • LocalTrans: "LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation", ICCV, 2021 (Tsinghua). [Paper]
  • Image Registration:
    • AiR: "Attention for Image Registration (AiR): an unsupervised Transformer approach", arXiv, 2021 (INRIA). [Paper]
  • Image Retrieval:
    • RRT: "Instance-level Image Retrieval using Reranking Transformers", ICCV, 2021 (University of Virginia). [Paper][PyTorch]
    • SwinFGHash: "SwinFGHash: Fine-grained Image Retrieval via Transformer-based Hashing Network", BMVC, 2021 (Tsinghua). [Paper]
    • ViT-Retrieval: "Investigating the Vision Transformer Model for Image Retrieval Tasks", arXiv, 2021 (Democritus University of Thrace). [Paper]
    • IRT: "Training Vision Transformers for Image Retrieval", arXiv, 2021 (Facebook + INRIA). [Paper]
    • TransHash: "TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval", arXiv, 2021 (Shanghai Jiao Tong University). [Paper]
    • VTS: "Vision Transformer Hashing for Image Retrieval", arXiv, 2021 (IIIT-Allahabad). [Paper]
    • GTZSR: "Zero-Shot Sketch Based Image Retrieval using Graph Transformer", arXiv, 2022 (IIT Bombay). [Paper]
    • EViT: "EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing", arXiv, 2022 (Jinan University). [Paper][PyTorch (in construction)]
    • ?: "Transformers and CNNs both Beat Humans on SBIR", arXiv, 2022 (University of Mons, Belgium). [Paper]
    • DToP: "Boosting vision transformers for image retrieval", WACV, 2023 (Dealicious, Korea). [Paper][Code (in construction)]
    • ?: "A Light Touch Approach to Teaching Transformers Multi-view Geometry", CVPR, 2023 (Oxford). [Paper]
    • IRGen: "IRGen: Generative Modeling for Image Retrieval", arXiv, 2023 (Microsoft). [Paper]
    • CIReVL: "Vision-by-Language for Training-Free Compositional Image Retrieval", arXiv, 2023 (University of Tübingen, Germany). [Paper]
  • Layout Generation:
    • VTN: "Variational Transformer Networks for Layout Generation", CVPR, 2021 (Google). [Paper]
    • LayoutTransformer: "LayoutTransformer: Scene Layout Generation With Conceptual and Spatial Diversity", CVPR, 2021 (NTU). [Paper][PyTorch]
    • LayoutTransformer: "LayoutTransformer: Layout Generation and Completion with Self-attention", ICCV, 2021 (Amazon). [Paper][Website]
    • LGT-Net: "LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
    • CADTransformer: "CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings", CVPR, 2022 (UT Austin). [Paper]
    • GAT-CADNet: "GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD Drawings", CVPR, 2022 (TUM + Alibaba). [Paper]
    • LayoutBERT: "LayoutBERT: Masked Language Layout Model for Object Insertion", CVPRW, 2022 (Adobe). [Paper]
    • ICVT: "Geometry Aligned Variational Transformer for Image-conditioned Layout Generation", ACMMM, 2022 (Alibaba). [Paper]
    • BLT: "BLT: Bidirectional Layout Transformer for Controllable Layout Generation", ECCV, 2022 (Google). [Paper][Tensorflow][Website]
    • ATEK: "ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis", arXiv, 2022 (New Jersey Institute of Technology). [Paper]
    • ?: "Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades", arXiv, 2022 (Simon Fraser). [Paper]
    • LayoutFormer++: "LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction", CVPR, 2023 (Microsoft). [Paper]
    • RoomFormer: "Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries", CVPR, 2023 (ETH Zurich). [Paper][PyTorch][Website]
    • LayoutDM: "LayoutDM: Transformer-based Diffusion Model for Layout Generation", CVPR, 2023 (USTC). [Paper]
    • DLT: "DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer", ICCV, 2023 (Wix.com). [Paper]
  • Livestock Monitoring:
    • STARFormer: "Livestock Monitoring with Transformer", BMVC, 2021 (IIT Dhanbad). [Paper]
  • Metric Learning:
    • Hyp-ViT: "Hyperbolic Vision Transformers: Combining Improvements in Metric Learning", CVPR, 2022 (University of Trento, Italy). [Paper][PyTorch]
    • BGFormer: "Rethinking Batch Sample Relationships for Data Representation: A Batch-Graph Transformer based Approach", arXiv, 2022 (Anhui University). [Paper]
    • ?: "Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning", CVPR, 2023 (LMU Munich). [Paper]
  • Multi-Input:
    • MixViT: "Adapting Multi-Input Multi-Output schemes to Vision Transformers", CVPRW, 2022 (Sorbonne Universite, France). [Paper]
  • Multi-label:
    • C-Tran: "General Multi-label Image Classification with Transformers", CVPR, 2021 (University of Virginia). [Paper]
    • TDRG: "Transformer-Based Dual Relation Graph for Multi-Label Image Recognition", ICCV, 2021 (Tencent). [Paper]
    • MlTr: "MlTr: Multi-label Classification with Transformer", arXiv, 2021 (KuaiShou). [Paper]
    • GATN: "Graph Attention Transformer Network for Multi-Label Image Classification", arXiv, 2022 (Southeast University, China). [Paper]
    • CDUL: "CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification", ICCV, 2023 (University of Southern Mississippi, Mississippi). [Paper]
    • TagCLIP: "TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training", AAAI, 2024 (Zhejiang). [Paper][PyTorch]
  • Multi-task:
    • MulT: "MulT: An End-to-End Multitask Learning Transformer", CVPR, 2022 (EPFL). [Paper]
    • UFO: "UFO: Unified Feature Optimization", ECCV, 2022 (Baidu). [Paper][PaddlePaddle]
    • Painter: "Images Speak in Images: A Generalist Painter for In-Context Visual Learning", CVPR, 2023 (BAAI). [Paper][Code (in construction)]
    • MTLoRA: "MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning", CVPR, 2024 (Brown). [Paper][PyTorch]
  • Open Set:
    • OSR-ViT: "Open Set Recognition using Vision Transformer with an Additional Detection Head", arXiv, 2022 (Vanderbilt University, Tennessee). [Paper]
  • Operator Learning for PDEs:
    • Galerkin Transformer: "Choose a Transformer: Fourier or Galerkin", NeurIPS, 2021 (Washington University, St. Louis). [Paper][PyTorch]
    • Coupled Attention: "Learning operators with coupled attention", JMLR, 2022 (University of Pennsylvania). [Paper]
    • HT-Net: "HT-Net: Hierarchical Transformer based Operator Learning Model for Multiscale PDEs", arXiv, 2022 (KAUST). [Paper]
    • Relative-PE: "Transformer for Partial Differential Equations' Operator Learning", arXiv, 2022 (CMU). [Paper]
  • Out-Of-Distribution (OOD):
    • OODformer: "OODformer: Out-Of-Distribution Detection Transformer", BMVC, 2021 (LMU Munich). [Paper][PyTorch]
    • MCM: "Delving into Out-of-Distribution Detection with Vision-Language Representations", NeurIPS, 2022 (UW-Madison). [Paper]
    • MOOD: "Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need", CVPR, 2023 (CUHK). [Paper][PyTorch]
    • ?: "Masked Images Are Counterfactual Samples for Robust Fine-tuning", CVPR, 2023 (Sun Yat-sen University). [Paper][PyTorch]
    • CLIPood: "CLIPood: Generalizing CLIP to Out-of-Distributions", ICML, 2023 (Tsinghua). [Paper]
    • CLIPN: "CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No", ICCV, 2023 (HKUST). [Paper][PyTorch]
    • ?: "Distilling Large Vision-Language Model with Out-of-Distribution Generalizability", ICCV, 2023 (UCSD). [Paper][PyTorch]
    • DREAM-OOD: "Dream the Impossible: Outlier Imagination with Diffusion Models", NeurIPS, 2023 (UW Madison). [Paper]
    • LoCoOp: "LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning", NeurIPS, 2023 (The University of Tokyo). [Paper][PyTorch]
    • ?: "A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)", NeurIPS, 2023 (ANU). [Paper]
    • GL-MCM: "Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models", arXiv, 2023 (The University of Tokyo). [Paper]
    • CLIP-OOD: "Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?", arXiv, 2023 (University of Tübingen). [Paper]
    • MOODv2: "MOODv2: Masked Image Modeling for Out-of-Distribution Detection", arXiv, 2024 (CUHK). [Paper]
    • AutoFT: "AutoFT: Robust Fine-Tuning by Optimizing Hyperparameters on OOD Data", arXiv, 2024 (Stanford). [Paper]
  • Pedestrian Intention:
    • IntFormer: "IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture", arXiv, 2021 (Universidad de Alcala). [Paper]
  • Physics Simulation:
    • TIE: "Transformer with Implicit Edges for Particle-based Physics Simulation", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
  • Place Recognition:
    • SVT-Net: "SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers", AAAI, 2022 (Renmin University of China). [Paper]
    • TransVPR: "TransVPR: Transformer-based place recognition with multi-level attention aggregation", CVPR, 2022 (Xi'an Jiaotong). [Paper]
    • OverlapTransformer: "OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition", IROS, 2022 (HAOMO.AI, China). [Paper][PyTorch]
    • SeqOT: "SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data", arXiv, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
    • R2Former: "R2Former: Unified Retrieval and Reranking Transformer for Place Recognition", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
    • BoQ: "BoQ: A Place is Worth a Bag of Learnable Queries", CVPR, 2024 (Universite Laval, Canada). [Paper][Code (in construction)]
  • Remote Sensing/Hyperspectral/Satellite:
    • DCFAM: "Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images", arXiv, 2021 (Wuhan University). [Paper]
    • WiCNet: "Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images", arXiv, 2021 (University of Trento). [Paper]
    • ?: "Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images", arXiv, 2021 (University of Orleans, France). [Paper]
    • Satellite-ViT: "Manipulation Detection in Satellite Images Using Vision Transformer", arXiv, 2021 (Purdue). [Paper]
    • ?: "Self-supervised Vision Transformers for Joint SAR-optical Representation Learning", IGARSS, 2022 (German Aerospace Center). [Paper]
    • VBFusion: "Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing", SPIE Remote Sensing, 2022 (Technische Universitat Berlin, Germany). [Paper][PyTorch]
    • SatMAE: "SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery", NeurIPS, 2022 (Stanford). [Paper]
    • ANDT: "Anomaly Detection in Aerial Videos with Transformers", IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2022 (TUM). [Paper]
    • RNGDet: "RNGDet: Road Network Graph Detection by Transformer in Aerial Images", arXiv, 2022 (HKUST). [Paper]
    • FSRA: "A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization", arXiv, 2022 (China Jiliang University). [Paper][PyTorch]
    • ?: "Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Imag (e Cl)assificationtion", arXiv, 2022 (Shenzhen University). [Paper]
    • ?: "Deep Hyperspectral Unmixing using Transformer Network", arXiv, 2022 (Jalpaiguri Engineering College, India). [Paper]
    • SiamixFormer: "SiamixFormer: A Siamese Transformer Network For Building Detection And Change Detection From Bi-Temporal Remote Sensing Images", arXiv, 2022 (Tarbiat Modares University, Iran). [Paper]
    • DAHiTrA: "DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture", arXiv, 2022 (Simon Fraser University, Canada). [Paper]
    • RVSA: "Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model", arXiv, 2022 (Wuhan University + The University of Sydney). [Paper]
    • SatViT: "Transfer Learning with Pretrained Remote Sensing Transformers", arXiv, 2022 (?). [Paper][PyTorch]
    • FTN: "Fully Transformer Network for Change Detection of Remote Sensing Images", arXiv, 2022 (Dalian University of Technology). [Paper]
    • MCTNet: "MCTNet: A Multi-Scale CNN-Transformer Network for Change Detection in Optical Remote Sensing Images", arXiv, 2022 (Tsinghua University). [Paper]
    • ?: "Transformers For Recognition In Overhead Imagery: A Reality Check", arXiv, 2022 (Duke University). [Paper]
    • TSViT: "ViTs for SITS: Vision Transformers for Satellite Image Time Series", CVPR, 2023 (ICL). [Paper][PyTorch]
    • MethaneMapper: "MethaneMapper: Spectral Absorption aware Hyperspectral Transformer for Methane Detection", CVPR, 2023 (UCSB). [Paper]
    • GFM: "Towards Geospatial Foundation Models via Continual Pretraining", ICCV, 2023 (Amazon). [Paper][PyTorch (in construction)]
    • Scale-MAE: "Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning", ICCV, 2023 (Berkeley). [Paper]
    • SAMRS: "SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model", NeurIPS (Datasets and Benchmarks), 2023 (iFlytek, China). [Paper][PyTorch]
    • RS5M: "RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model", arXiv, 2023 (Zhejiang University). [Paper][Code (in construction)]
    • RSGPT: "RSGPT: A Remote Sensing Vision Language Model and Benchmark", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
    • EarthGPT: "EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain", arXiv, 2024 (Beijing Institute of Technology). [Paper]
    • AnyChange: "Segment Any Change", arXiv, 2024 (Stanford). [Paper]
    • MMEarth: "MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning", arXiv, 2024 (University of Copenhagen, Denmark). [Paper][PyTorch][Dataset][Website]
  • Robotics:
    • TF-Grasp: "When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection", arXiv, 2022 (University of Science and Technology of China). [Paper][Code (in construction)]
    • BeT: "Behavior Transformers: Cloning k modes with one stone", arXiv, 2022 (NYU). [Paper][PyTorch]
    • Perceiver-Actor: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", Conference on Robot Learning (CoRL), 2022 (NVIDIA). [Paper][Website]
    • PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", arXiv, 2022 (Microsoft). [Paper]
    • ?: "A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
    • ?: "Grounding Language with Visual Affordances over Unstructured Data", arXiv, 2022 (University of Freiburg, Germany). [Paper][Website]
    • ?: "Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation", ICLR, 2023 (DeepMind). [Paper]
    • LOCATE: "LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding", CVPR, 2023 (University of Edinburgh, UK). [Paper][PyTorch][Website]
    • Afformer: "Affordance Grounding from Demonstration Video to Target Image", CVPR, 2023 (NUS). [Paper][PyTorch]
    • MV-MWM: "Multi-View Masked World Models for Visual Robotic Manipulation", ICML, 2023 (KAIST). [Paper][Tensorflow2][Website]
    • MTM: "Masked Trajectory Models for Prediction, Representation, and Control", ICML, 2023 (Meta). [Paper][PyTorch][Website]
    • Skill-Transformer: "Skill Transformer: A Monolithic Policy for Mobile Manipulation", ICCV, 2023 (Georgia Tech). [Paper]
    • RUPs: "Nonrigid Object Contact Estimation With Regional Unwrapping Transformer", ICCV, 2023 (Southeast University, China). [Paper]
    • IAG: "Grounding 3D Object Affordance from 2D Interactions in Images", ICCV, 2023 (USTC). [Paper][Website][PyTorch]
    • RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • M2T2: "M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place", CoRL, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • ?: "Humanoid Locomotion as Next Token Prediction", arXiv, 2024 (Berkeley). [Paper]
  • Scene Decomposition:
    • SRT: "Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations", CVPR, 2022 (Google). [Paper][PyTorch (stelzner)][Website]
    • OSRT: "Object Scene Representation Transformer", NeurIPS, 2022 (Google). [Paper][Website]
    • Prompter: "Prompter: Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following", arXiv, 2022 (Hitachi). [Paper]
    • RePAST: "RePAST: Relative Pose Attention Scene Representation Transformer", arXiv, 2023 (Google). [Paper]
    • GTA: "GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers", arXiv, 2023 (University of Tubingen). [Paper]
  • Scene Text Recognition:
    • ViTSTR: "Vision Transformer for Fast and Efficient Scene Text Recognition", ICDAR, 2021 (University of the Philippines). [Paper]
    • STKM: "Self-attention based Text Knowledge Mining for Text Detection", CVPR, 2021 (?). [Paper][Code (in construction)]
    • I2C2W: "I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition", arXiv, 2021 (NTU Singapoer). [Paper]
    • CornerTransformer: "Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • CUTE: "Contextual Text Block Detection towards Scene Text Understanding", ECCV, 2022 (NTU Singapore). [Paper][Website]
    • PARSeq: "Scene Text Recognition with Permuted Autoregressive Sequence Models", ECCV, 2022 (University of the Philippines). [Paper][PyTorch]
    • PTIE: "Pure Transformer with Integrated Experts for Scene Text Recognition", ECCV, 2022 (NTU Singapore). [Paper]
    • MGP-STR: "Multi-Granularity Prediction for Scene Text Recognition", ECCV, 2022 (Alibaba). [Paper]
    • VLAMD: "Vision-Language Adaptive Mutual Decoder for OOV-STR", ECCVW, 2022 (iFLYTEK, China). [Paper]
    • MVLT: "Masked Vision-Language Transformers for Scene Text Recognition", BMVC, 2022 (Westone Information Industry Inc., China). [Paper][PyTorch]
  • Sign Language:
    • LWTA: "Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation", ICCV, 2021 (Cyprus University of Technology). [Paper]
    • CiCo: "CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
    • GFSLT-VLP: "Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining", ICCV, 2023 (Macau University of Science and Technology (MUST)). [Paper][Code (in construction)]
    • IP-SLT: "Sign Language Translation with Iterative Prototype", ICCV, 2023 (USTC). [Paper]
    • SignBERT+: "SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding", TPAMI, 2023 (USTC). [Paper][Website]
    • Sign2GPT: "Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation", ICLR, 2024 (University of Surrey). [Paper]
    • SignLLM: "SignLLM: Sign Languages Production Large Language Models", arXiv, 2024 (Rutgers). [Paper][Website]
  • Spike:
    • Spikformer: "Spikformer: When Spiking Neural Network Meets Transformer", arXiv, 2022 (Peking). [Paper]
    • SDSA: "Spike-driven Transformer", NeurIPS, 2023 (CAS). [Paper][PyTorch]
    • Meta-SpikeFormer: "Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips", ICLR, 2024 (CAS). [Paper][PyTorch]
  • Stereo:
    • STTR: "Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers", ICCV, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PS-Transformer: "PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism", BMVC, 2021 (National Institute of Informatics, JAPAN). [Paper][PyTorch]
    • ChiTransformer: "ChiTransformer: Towards Reliable Stereo from Cues", CVPR, 2022 (GSU). [Paper]
    • TransMVSNet: "TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers", CVPR, 2022 (Megvii). [Paper][Code (in construction)]
    • MVSTER: "MVSTER: Epipolar Transformer for Efficient Multi-View Stereo", ECCV, 2022 (CAS). [Paper][PyTorch]
    • CEST: "Context-Enhanced Stereo Transformer", ECCV, 2022 (CAS). [[Paper](Context-Enhanced Stereo Transformer)][PyTorch]
    • WT-MVSNet: "WT-MVSNet: Window-based Transformers for Multi-view Stereo", NeurIPS, 2022 (Tsinghua University). [Paper]
    • MVSFormer: "MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo", arXiv, 2022 (Fudan University). [Paper]
    • MVSFormer++: "MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo", ICLR, 2024 (Fudan). [Paper][Code (in construction)]
  • Tactile:
  • Time Series:
    • MissFormer: "MissFormer: (In-)attention-based handling of missing observations for trajectory filtering and prediction", arXiv, 2021 (Fraunhofer IOSB, Germany). [Paper]
  • Traffic:
    • NEAT: "NEAT: Neural Attention Fields for End-to-End Autonomous Driving", ICCV, 2021 (MPI). [Paper][PyTorch]
    • ViTAL: "Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder", IV, 2021 (Technische Hochschule Ingolstadt). [Paper]
    • ?: "Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information", IVS, 2021 (Universidad de Alcala). [Paper]
    • ?: "Translating Images into Maps", ICRA, 2022 (University of Surrey, UK). [Paper][PyTorch (in construction)]
    • Crossview-Transformer: "Cross-view Transformers for real-time Map-view Semantic Segmentation", CVPR, 2022 (UT Austin). [Paper][PyTorch]
    • MSF3DDETR: "MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving", ICPRW, 2022 (University of Coimbra, Portugal). [Paper]
    • TransLPC: "Transformers for Object Detection in Large Point Clouds", ITSC, 2022 (Bosch). [Paper]
    • PicT: "PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification", ACMMM, 2022 (Chongqing University). [Paper][PyTorch (in construction)]
    • JPerceiver: "JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
    • V2X-ViT: "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer", ECCV, 2022 (UCLA). [Paper]
    • ?: "Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects?", IROSW, 2022 (Bosch). [Paper]
    • MTR: "Motion Transformer with Global Intention Localization and Local Movement Refinement", NeurIPS, 2022 (MPI). [Paper][Code (in construction)]
    • PlanT: "PlanT: Explainable Planning Transformers via Object-Level Representations", Conference on Robot Learning (CoRL), 2022 (TUM). [Paper][PyTorch][Website]
    • ParkPredict+: "ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer", arXiv, 2022 (Berkeley). [Paper]
    • ?: "Pyramid Transformer for Traffic Sign Detection", arXiv, 2022 (Iran University of Science and Technology). [Paper]
    • STrajNet: "STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer", arXiv, 2022 (NTU, Singapore). [Paper]
    • MTPP: "Multi-modal Transformer Path Prediction for Autonomous Vehicle", arXiv, 2022 (National Central University). [Paper]
    • DCT: "A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View", arXiv, 2022 (Gwang-ju Institute of Science and Technology). [Paper]
    • C-ViT: "Traffic Accident Risk Forecasting using Contextual Vision Transformers", arXiv, 2022 (University of Technology Sydney). [Paper]
    • MapTR: "MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction", ICLR, 2023 (Horizon Robotics). [Paper][PyTorch]
    • VE-Prompt: "Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving", CVPR, 2023 (Sun Yat-sen University). [Paper]
    • TPVFormer: "Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction", CVPR, 2023 (Tsinghua University). [Paper][PyTorch]
    • TBP-Former: "TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving", CVPR, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
    • BAEFormer: "BAEFormer: Bi-directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation", CVPR, 2023 (Horizon Robotics). [Paper]
    • BAAM: "BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling", CVPR, 2023 (Chungnam National University, Korea). [Paper][PyTorch]
    • Pix2Map: "Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images", CVPR, 2023 (CMU). [Paper][Website]
    • UniAD: "Planning-oriented Autonomous Driving", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • Multiverse-Transformer: "Multiverse Transformer: 1st Place Solution for Waymo Open Sim Agents Challenge 2023", CVPRW, 2023 (Pegasus). [Paper][Website]
    • UniFormer: "UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View", ICCV, 2023 (Zhejiang University). [Paper]
    • SegMiF: "Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation", ICCV, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
    • VTD: "Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving", ICCV, 2023 (ETHZ). [Paper]
    • HM-ViT: "HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with vision transformer", ICCV, 2023 (UCLA). [Paper][Code (in construction)]
    • UP-VL: "Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving", ICCV, 2023 (Waymo). [Paper]
    • GameFormer: "GameFormer: Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • GeoMIM: "GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding", ICCV, 2023 (CUHK). [Paper][PyTorch]
    • LiDARFormer: "LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception", arXiv, 2023 (TuSimple). [Paper]
    • VoxelFormer: "VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
    • LCTGen: "Language Conditioned Traffic Generation", arXiv, 2023 (NVIDIA). [Paper][Website]
    • UniWorld: "UniWorld: Autonomous Driving Pre-training via World Models", arXiv, 2023 (Peking). [Paper][Code (in construction)]
    • PromptTrack: "Language Prompt for Autonomous Driving", arXiv, 2023 (Megvii). [Paper]
    • HiLM-D: "HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving", arXiv, 2023 (Huawei). [Paper]
    • DiffPrompter: "DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions", arXiv, 2023 (IIIT Hyderabad). [Paper][PyTorch][Website]
    • OccWorld: "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving", arXiv, 2023 (Tsinghua). [Paper][PyTorch][Website]
    • VehicleMAE: "Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception", AAAI, 2024 (Anhui University). [Paper]
    • STT: "STT: Stateful Tracking with Transformers for Autonomous Driving", ICRA, 2024 (Waymo). [Paper]
    • MM-AU: "Abductive Ego-View Accident Video Understanding for Safe Driving Perception", CVPR, 2024 (Xi'an Jiaotong University). [Paper][Website]
    • GenAD: "Generalized Predictive Model for Autonomous Driving", CVPR, 2024 (Shanghai AI Lab). [Paper][Code (in construction)]
    • DriveWorld: "DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving", CVPR, 2024 (Peking). [Paper]
  • Traffic (LLM-based):
    • AVIS: "AVIS: Autonomous Visual Information Seeking with Large Language Models", NeurIPS, 2023 (Google). [Paper]
    • DriveGPT4: "DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model", arXiv, 2023 (HKU). [Paper][Website]
    • ?: "Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving", arXiv, 2023 (Wayve). [Paper][PyTorch]
    • GPT4V-AD: "On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • Agent-Driver: "A Language Agent for Autonomous Driving", arXiv, 2023 (USC). [Paper][Code (in construction)][Website]
    • ADriver-I: "ADriver-I: A General World Model for Autonomous Driving", arXiv, 2023 (Megvii). [Paper]
    • Dolphins: "Dolphins: Multimodal Language Model for Driving", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)][Website]
    • LMDrive: "LMDrive: Closed-Loop End-to-End Driving with Large Language Models", arXiv, 2023 (CUHK). [Paper]
    • DriveMLM: "DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving", arXiv, 2023 (shanghai AI Lab). [Paper][Code (in construction)]
    • DriveLM: "DriveLM: Driving with Graph Visual Question Answering", arXiv, 2023 (OpenDriveLab, China). [Paper][Code]
    • VLP: "VLP: Vision Language Planning for Autonomous Driving", arXiv, 2024 (Bosch). [Paper]
    • DriveVLM: "DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models", arXiv, 2024 (Tsinghua). [Paper][Website]
    • DriveDreamer-2: "DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation", arXiv, 2024 (CAS). [Paper][Code (in construction)][Website]
    • OmniDrive: "OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning", arXiv, 2024 (NVIDIA). [Paper][Code (in construction)]
    • DriveSim: "Probing Multimodal LLMs as World Models for Driving", arXiv, 2024 (MIT). [Paper][Code (in construction)]
  • Trajectory Prediction:
    • mmTransformer: "Multimodal Motion Prediction with Stacked Transformers", CVPR, 2021 (CUHK + SenseTime). [Paper][Code (in construction)][Website]
    • AgentFormer: "AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting", ICCV, 2021 (CMU). [Paper][PyTorch][Website]
    • S2TNet: "S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving", ACML, 2021 (Xi'an Jiaotong University). [Paper][PyTorch]
    • MRT: "Multi-Person 3D Motion Prediction with Multi-Range Transformers", NeurIPS, 2021 (UCSD + Berkeley). [Paper][PyTorch][Website]
    • ?: "Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction", ICLR, 2022 (MILA). [Paper]
    • Scene-Transformer: "Scene Transformer: A unified architecture for predicting multiple agent trajectories", ICLR, 2022 (Google). [Paper]
    • ST-MR: "Graph-based Spatial Transformer with Memory Replay for Multi-Future Pedestrian Trajectory Prediction", CVPR, 2022 (University of New South Wales, Australia). [Paper][Tensorflow]
    • HiVT: "HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction", CVPR, 2022 (CUHK). [Paper]
    • EF-Transformer: "Entry-Flipped Transformer for Inference and Prediction of Participant Behavior", ECCV, 2022 (NTU, Singapore). [Paper]
    • Social-SSL: "Social-SSL: Self-Supervised Cross-Sequence Representation Learning Based on Transformers for Multi-Agent Trajectory Prediction", ECCV, 2022 (NYCU). [Paper][PyTorch]
    • LatentFormer: "LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction", arXiv, 2022 (Huawei). [Paper]
    • PreTR: "PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer", arXiv, 2022 (Stellantis, France). [Paper]
    • Wayformer: "Wayformer: Motion Forecasting via Simple & Efficient Attention Networks", arXiv, 2022 (Waymo). [Paper]
    • LaTTe: "LaTTe: Language Trajectory TransformEr", arXiv, 2022 (TUM). [Paper][Tensorflow]
    • SoMoFormer: "SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • ViewBirdiformer: "ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view", arXiv, 2022 (Kyoto University). [Paper]
    • PedFormer: "PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning", arXiv, 2022 (Huawei). [Paper]
    • TAMFormer: "TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction", arXiv, 2022 (University of Padova, Italy). [Paper]
    • QCNet: "Query-Centric Trajectory Prediction", CVPR, 2023 (CUHK). [Paper][Code (in construction)]
    • ViP3D: "ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries", CVPR, 2023 (Tsinghua). [Paper][PyTorch][Website]
    • USST: "Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting", ICCV, 2023 (OPPO). [Paper][PyTorch][Website]
    • JRTransformer: "Joint-Relation Transformer for Multi-Person Motion Prediction", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
    • Forecast-MAE: "Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders", ICCV, 2023 (HKUST). [Paper][PyTorch]
    • MotionLM: "MotionLM: Multi-Agent Motion Forecasting as Language Modeling", ICCV, 2023 (Waymo). [Paper]
    • OccFormer: "OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction", ICCV, 2023 (PhiGent Robotics, China). [Paper][PyTorch]
    • Traj-MAE: "Traj-MAE: Masked Autoencoders for Trajectory Prediction", ICCV, 2023 (CUHK). [Paper]
    • R-Pred: "R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement", ICCV, 2023 (Hanyang University, Korea). [Paper]
    • MacFormer: "MacFormer: Map-Agent Coupled Transformer for Real-time and Robust Trajectory Prediction", RAL, 2023 (HKUST). [Paper]
    • HPTR: "Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding", NeurIPS, 2023 (ETHZ). [Paper][PyTorch]
    • InCrowdFormer: "InCrowdFormer: On-Ground Pedestrian World Model From Egocentric Views", arXiv, 2023 (Kyoto University). [Paper]
    • MTR++: "MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying", arXiv, 2023 (MPI). [Paper]
    • T4P: "T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory", CVPR, 2024 (KAIST). [Paper][Code (in construction)]
    • MoST: "MoST: Multi-modality Scene Tokenization for Motion Prediction", CVPR, 2024 (Waymo). [Paper]
  • Visual Counting:
    • CC-AV: "Audio-Visual Transformer Based Crowd Counting", ICCVW, 2021 (University of Kansas). [Paper]
    • TransCrowd: "TransCrowd: Weakly-Supervised Crowd Counting with Transformer", arXiv, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • TAM-RTM: "Boosting Crowd Counting with Transformers", arXiv, 2021 (ETHZ). [Paper]
    • CCTrans: "CCTrans: Simplifying and Improving Crowd Counting with Transformer", arXiv, 2021 (Meituan). [Paper]
    • MAN: "Boosting Crowd Counting via Multifaceted Attention", CVPR, 2022 (Xi'an Jiaotong). [Paper][PyTorch]
    • CLTR: "An End-to-End Transformer Model for Crowd Localization", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch][Website]
    • SAANet: "Scene-Adaptive Attention Network for Crowd Counting", arXiv, 2022 (Xi'an Jiaotong). [Paper]
    • JCTNet: "Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting", arXiv, 2022 (Chongqing University). [Paper]
    • CrowdMLP: "CrowdMLP: Weakly-Supervised Crowd Counting via Multi-Granularity MLP", arXiv, 2022 (University of Guelph, Canada). [Paper]
    • CounTR: "CounTR: Transformer-based Generalised Visual Counting", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][Website]
    • CrowdCLIP: "CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model", CVPR, 2023 (Baidu). [Paper][Code (in construction)]
    • PET: "Point-Query Quadtree for Crowd Counting, Localization, and More", ICCV, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • CLIP-Count: "CLIP-Count: Towards Text-Guided Zero-Shot Object Counting", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
    • ?: "Training-free Object Counting with Prompts", arXiv, 2023 (A\⋆STAR). [Paper][PyTorch]
    • T-Rex: "T-Rex: Counting by Visual Prompting", arXiv, 2023 (IDEA). [Paper][Website]
    • VLCounter: "VLCounter: Text-aware VIsual Representation for Zero-Shot Object Counting", AAAI, 2024 (Sungkyunkwan University, Korea). [Paper][PyTorch]
    • Gramformer: "Gramformer: Learning Crowd Counting via Graph-Modulated Transformer", AAAI, 2024 (Xi'an Jiaotong). [Paper][Code (in construction)]
  • Visual Quality Assessment:
    • TRIQ: "Transformer for Image Quality Assessment", arXiv, 2020 (NORCE). [Paper][Tensorflow-Keras]
    • IQT: "Perceptual Image Quality Assessment with Transformers", CVPRW, 2021 (LG). [Paper][Code (in construction)]
    • MUSIQ: "MUSIQ: Multi-scale Image Quality Transformer", ICCV, 2021 (Google). [Paper]
    • TranSLA: "Saliency-Guided Transformer Network Combined With Local Embedding for No-Reference Image Quality Assessment", ICCVW, 2021 (Hikvision). [Paper]
    • TReS: "No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency", WACV, 2022 (CMU). [Paper]
    • IQA-Conformer: "Conformer and Blind Noisy Students for Improved Image Quality Assessment", CVPRW, 2022 (University of Wurzburg, Germany). [Paper][PyTorch]
    • SwinIQA: "SwinIQA: Learned Swin Distance for Compressed Image Quality Assessment", CVPRW, 2022 (USTC, China). [Paper]
    • DCVQE: "DCVQE: A Hierarchical Transformer for Video Quality Assessment", ACCV, 2022 (Weibo). [Paper]
    • MCAS-IQA: "Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment", arXiv, 2022 (Norwegian Research Centre, Norway). [Paper]
    • MSTRIQ: "MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion", arXiv, 2022 (ByteDance). [Paper]
    • DisCoVQA: "DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment", arXiv, 2022 (NTU, Singapore). [Paper]
    • LIQE: "Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective", CVPR, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
    • MRET: "MRET: Multi-resolution Transformer for Video Quality Assessment", arXiv, 2023 (Google). [Paper]
    • SAM-IQA: "SAM-IQA: Can Segment Anything Boost Image Quality Assessment?", arXiv, 2023 (Megvii). [Paper][Code (in construction)]
    • LoDa: "Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment", arXiv, 2023 (Wuhan University). [Paper]
    • Q-Align: "Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • SAMA: "Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment", AAAI, 2024 (Xidian University). [Paper][PyTorch]
    • Co-Instruct: "Towards Open-ended Visual Quality Comparison", arXiv, 2024 (NTU, Singapore). [Paper][Model]
    • ?: "A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment", arXiv, 2024 (Tsinghua). [Paper][Code (in construction)]
  • Visual Reasoning:
    • SAViR-T: "SAViR-T: Spatially Attentive Visual Reasoning with Transformers", arXiv, 2022 (Rutgers University). [Paper]
  • Wide-angle lenses:
    • DarSwin: "DarSwin: Distortion Aware Radial Swin Transformer", ICCV, 2023 (Laval University, Canada). [Paper][PyTorch][Website]
  • 3D Human Texture Estimation:
    • Texformer: "3D Human Texture Estimation from a Single Image with Transformers", ICCV, 2021 (NTU, Singapore). [Paper][PyTorch][Website]
  • 3D Motion Synthesis:
    • ACTOR: "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV, 2021 (Univ Gustave Eiffel). [Paper][PyTorch][Website]
    • RTVAE: "Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis", CVPRW, 2022 (Amazon). [Paper]
    • MotionCLIP: "MotionCLIP: Exposing Human Motion Generation to CLIP Space", ECCV, 2022 (Tel Aviv). [Paper]
    • CLIP-Actor: "CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes", ECCV, 2022 (POSTECH). [Paper][PyTorch][Website]
    • PoseGPT: "PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting", ECCV, 2022 (NAVER). [Paper]
    • TEMOS: "TEMOS: Generating diverse human motions from textual descriptions", ECCV, 2022 (MPI). [Paper][PyTorch][Website]
    • TM2T: "TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts", ECCV, 2022 (University of Alberta, Canada). [Paper][PyTorch][Website]
    • HUMANISE: "HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes", NeurIPS, 2022 (Beijing Institute of Technology). [Paper][GitHub][Website]
    • ?: "Diverse Dance Synthesis via Keyframes with Transformer Controllers", arXiv, 2022 (Beihang University). [Paper]
    • MARIONET: "NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System", arXiv, 2022 (Wuhan University). [Paper]
    • Action-GPT: "Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation", arXiv, 2022 (IIIT Hyderabad). [Paper][Website]
    • MDM: "Human Motion Diffusion Model", ICLR, 2023 (Tel Aviv University). [Paper][PyTorch][Website]
    • POTTER: "POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery", CVPR, 2023 (OPPO). [Paper][PyTorch][Website]
    • Optimus: "Transformer-Based Learned Optimization", CVPR, 2023 (Google). [Paper]
    • CITL: "Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
    • OOHMG: "Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training", CVPR, 2023 (Sun Yat-Sen University). [Paper][Code (in construction)]
    • AttT2M: "AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism", ICCV, 2023 (CAS). [Paper][PyTorch]
    • ActFormer: "ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation", ICCV, 2023 (SenseTime). [Paper]
    • AvatarJLM: "Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling", ICCV, 2023 (ByteDance). [Paper][PyTorch][Website]
    • Fg-T2M: "Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model", ICCV, 2023 (Beihang). [Paper]
    • TMR: "TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis", ICCV, 2023 (Gustave Eiffel University). [Paper][PyTorch][Website]
    • Make-An-Animation: "Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation", ICCV, 2023 (Meta). [Paper]
    • ATOM: "Language-guided Human Motion Synthesis with Atomic Actions", ACMMM, 2023 (University at Buffalo). [Paper][Code (in construction)]
    • MotionGPT: "MotionGPT: Human Motion as a Foreign Language", NeurIPS, 2023 (Fudan). [Paper][PyTorch][Website]
    • FineMoGen: "FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • DDT: "DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video", arXiv, 2023 (OPPO). [Paper]
    • MotionGPT: "MotionGPT: Finetuned LLMs are General-Purpose Motion Generators", arXiv, 2023 (USTC). [Paper][PyTorch (in construction)][Website]
    • UNIMASK-M: "A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis", arXiv, 2023 (Technische Universitat Wien (TUWien), Austria). [Paper][Website]
    • MMM: "MMM: Generative Masked Motion Model", arXiv, 2023 (UNC). [Paper][Code (in construction)][Website]
    • HOI-Diff: "HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models", arXiv, 2023 (Northeastern). [Paper][Website]
    • OMG: "OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers", arXiv, 2023 (ShanghaiTech). [Paper]
    • LEMON: "LEMON: Learning 3D Human-Object Interaction Relation from 2D Images", arXiv, 2023 (USTC). [Paper][Code (in construction)][Website]
    • MoST: "MoST: Motion Style Transformer between Diverse Action Contents", CVPR, 2024 (Korea Electronics Technology Institute). [Paper][Code (in construction)]
    • AMDM: "Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance", CVPR, 2024 (BIGAI). [Paper][Code (in construction)][Website]
    • ?: "Generating Human Motion in 3D Scenes from Text Descriptions", CVPR, 2024 (Zhejiang). [Paper][Website]
    • RoHM: "RoHM: Robust Human Motion Reconstruction via Diffusion", arXiv, 2024 (Meta). [Paper][Code (in construction)][Website]
    • STMC: "Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation", arXiv, 2024 (NVIDIA). [Paper][Website]
    • Motion-Mamba: "Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM", arXiv, 2024 (Monash). [Paper][code (in construction)][Website]
  • 3D Object Recognition:
    • MVT: "MVT: Multi-view Vision Transformer for 3D Object Recognition", BMVC, 2021 (Baidu). [Paper]
  • 3D Reconstruction:
    • PlaneTR: "PlaneTR: Structure-Guided Transformers for 3D Plane Recovery", ICCV, 2021 (Wuhan University). [Paper][PyTorch]
    • CO3D: "CommonObjects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction", ICCV, 2021 (Facebook). [Paper][PyTorch]
    • VolT: "Multi-view 3D Reconstruction with Transformer", ICCV, 2021 (University of British Columbia). [Paper]
    • 3D-RETR: "3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers", BMVC, 2021 (ETHZ). [Paper][PyTorch]
    • TransformerFusion: "TransformerFusion: Monocular RGB Scene Reconstruction using Transformers", NeurIPS, 2021 (TUM). [Paper][Website]
    • LegoFormer: "LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction", arXiv, 2021 (TUM + Google). [Paper]
    • PlaneFormers: "PlaneFormers: From Sparse View Planes to 3D Reconstruction", ECCV, 2022 (UMich). [Paper][PyTorch][Website]
    • 3D-C2FT: "3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction", arXiv, 2022 (Korea Institute of Science and Technology). [Paper]
    • SDF-Former: "Monocular Scene Reconstruction with 3D SDF Transformers", ICLR, 2023 (Alibaba). [Paper][Website]
    • AMVUR: "A Probabilistic Attention Model with Occlusion-aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image", CVPR, 2023 (Lancaster University, UK). [Paper]
    • LIST: "LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction", ICCV, 2023 (UT Arlington). [Paper]
    • LRGT: "Long-Range Grouping Transformer for Multi-View 3D Reconstruction", ICCV, 2023 (Macau University of Science and Technology). [Paper][PyTorch (in construction)]
    • Spectral-Graphormer: "Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images", ICCV, 2023 (Google). [Paper]
    • UMIFormer: "UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction", ICCV, 2023 (Macau University of Science and Technology). [Paper][PyTorch]
    • PlaneRecTR: "PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View", ICCV, 2023 (National University of Defense Technology, China). [Paper][PyTorch]
    • HaMeR: "Reconstructing Hands in 3D with Transformers", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
    • KYN: "Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning", CVPR, 2024 (ETHZ). [Paper][PyTorch][Website]
    • MCC-HO: "Reconstructing Hand-Held Objects in 3D", arXiv, 2024 (Berkeley). [Paper]
  • 360 Scene:
    • ?: "Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning", AAAI, 2022 (Seoul National University). [Paper][PyTorch]
    • PAVER: "Panoramic Vision Transformer for Saliency Detection in 360° Videos", ECCV, 2022 (Seoul National University). [Paper]
    • PanoFormer: "PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation", ECCV, 2022 (Beijing Jiaotong University). [Paper]
    • CoVisPose: "CoVisPose: Co-Visibility Pose Transformer for Wide-Baseline Relative Pose Estimation in 360° Indoor Panoramas", ECCV, 2022 (Zillow). [Paper]
    • SPH: "Spherical Transformer", arXiv, 2022 (Chung-Ang University, Korea). [Paper]
    • PanoSwin: "PanoSwin: a Pano-style Swin Transformer for Panorama Understanding", CVPR, 2023 (Fudan). [Paper][PyTorch]
    • SalViT360: "Spherical Vision Transformer for 360-degree Video Saliency Prediction", BMVC, 2023 (Koc University, Turkey). [Paper]
    • PanoContext-Former: "PanoContext-Former: Panoramic Total Scene Understanding with a Transformer", arXiv, 2023 (Alibaba). [Paper]
  • Others:
    • ?: "Connecting Compression Spaces with Transformer for Approximate Nearest Neighbor Search", ECCV, 2022 (Intellifusion, China). [Paper]
    • ?: "Strong Gravitational Lensing Parameter Estimation with Vision Transformer", ECCVW, 2022 (CMU). [Paper][PyTorch]
    • Transformer-DR: "Transformer-based dimensionality reduction", arXiv, 2022 (Chongqing Normal University, China). [Paper]
    • ?: "mm-Wave Radar Hand Shape Classification Using Deformable Transformers", arXiv, 2022 (Intel). [Paper]
    • ?: "Fully-attentive and interpretable: vision and video vision transformers for pain detection", NeurIPSW, 2022 (Utrecht University, Netherlands). [Paper][Code (in construction)]
    • CQFormer: "Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
    • CircuitFormer: "Circuit as Set of Points", NeurIPS, 2023 (Horizon Robotics). [Paper][PyTorch]
    • SleepVST: "SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers", CVPR, 2024 (Oxford). [Paper]

[Back to Overview]


Attention Mechanisms in Vision/NLP

Attention for Vision

  • AA: "Attention Augmented Convolutional Networks", ICCV, 2019 (Google). [Paper][PyTorch (Unofficial)][Tensorflow (Unofficial)]
  • LR-Net: "Local Relation Networks for Image Recognition", ICCV, 2019 (Microsoft). [Paper][PyTorch (Unofficial)]
  • CCNet: "CCNet: Criss-Cross Attention for Semantic Segmentation", ICCV, 2019 (& TPAMI 2020) (Horizon). [Paper][PyTorch]
  • GCNet: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) (Microsoft). [Paper][PyTorch]
  • SASA: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 (Google). [Paper][PyTorch-1 (Unofficial)][PyTorch-2 (Unofficial)]
    • key message: attention module is more efficient than conv & provide comparable accuracy
  • Axial-Transformer: "Axial Attention in Multidimensional Transformers", arXiv, 2019 (Google). [Paper][PyTorch (Unofficial)]
  • Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
  • SAN: "Exploring Self-attention for Image Recognition", CVPR, 2020 (CUHK + Intel). [Paper][PyTorch]
  • BA-Transform: "Non-Local Neural Networks With Grouped Bilinear Attentional Transforms", CVPR, 2020 (ByteDance). [Paper]
  • Axial-DeepLab: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 (Google). [Paper][PyTorch]
  • GSA: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (Unofficial)]
  • EA: "Efficient Attention: Attention with Linear Complexities", WACV, 2021 (SenseTime). [Paper][PyTorch]
  • LambdaNetworks: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 (Google). [Paper][PyTorch-1 (Unofficial)][PyTorch-2 (Unofficial)]
  • GSA-Nets: "Group Equivariant Stand-Alone Self-Attention For Vision", ICLR, 2021 (EPFL). [Paper]
  • Hamburger: "Is Attention Better Than Matrix Decomposition?", ICLR, 2021 (Peking). [Paper][PyTorch (Unofficial)]
  • HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper]
  • BoTNet: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 (Google). [Paper]
  • SSAN: "SSAN: Separable Self-Attention Network for Video Representation Learning", CVPR, 2021 (Microsoft). [Paper]
  • CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
  • Involution: "Involution: Inverting the Inherence of Convolution for Visual Recognition", CVPR, 2021 (HKUST). [Paper][PyTorch]
  • Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
  • SNL: "Unifying Nonlocal Blocks for Neural Networks", ICCV, 2021 (Peking + Bytedance). [Paper]
  • External-Attention: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua). [Paper]
  • Container: "Container: Context Aggregation Network", arXiv, 2021 (AI2). [Paper]
  • X-volution: "X-volution: On the unification of convolution and self-attention", arXiv, 2021 (Huawei Hisilicon). [Paper]
  • Invertible-Attention: "Invertible Attention", arXiv, 2021 (ANU). [Paper]
  • VOLO: "VOLO: Vision Outlooker for Visual Recognition", arXiv, 2021 (Sea AI Lab + NUS, Singapore). [Paper][PyTorch]
  • LESA: "Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms", arXiv, 2021 (Johns Hopkins). [Paper]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][JAX]
  • ?: "Fair Comparison between Efficient Attentions", CVPRW, 2022 (Kyungpook National University, Korea). [Paper][PyTorch]
  • KVT: "KVT: k-NN Attention for Boosting Vision Transformers", ECCV, 2022 (Alibaba). [Paper][PyTorch]
  • Hydra: "Hydra Attention: Efficient Attention with Many Heads", ECCVW, 2022 (Meta). [Paper]
  • HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
  • AttendNeXt: "Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers", arXiv, 2022 (University of Waterloo, Canada). [Paper]
  • Token-Mixing-Adaptive-FNO: "Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators", ICLR, 2022 (NVIDIA + Caltech + Stanford). [Paper][PyTorch]
  • KV-Transformer: "Key-Value Transformer", arXiv, 2023 (Quintic AI). [Paper]
  • NATTEN: "Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level", arXiv, 2024 (UIUC). [Paper][PyTorch][Website]

[Back to Overview]

Attention for NLP

  • T-DMCA: "Generating Wikipedia by Summarizing Long Sequences", ICLR, 2018 (Google). [Paper]
  • LSRA: "Lite Transformer with Long-Short Range Attention", ICLR, 2020 (MIT). [Paper][PyTorch]
  • ETC: "ETC: Encoding Long and Structured Inputs in Transformers", EMNLP, 2020 (Google). [Paper][Tensorflow]
  • BlockBERT: "Blockwise Self-Attention for Long Document Understanding", EMNLP Findings, 2020 (Facebook). [Paper][GitHub]
  • Clustered-Attention: "Fast Transformers with Clustered Attention", NeurIPS, 2020 (Idiap). [Paper][PyTorch][Website]
  • BigBird: "Big Bird: Transformers for Longer Sequences", NeurIPS, 2020 (Google). [Paper][Tensorflow]
  • Longformer: "Longformer: The Long-Document Transformer", arXiv, 2020 (AI2). [Paper][PyTorch]
  • Linformer: "Linformer: Self-Attention with Linear Complexity", arXiv, 2020 (Facebook). [Paper][PyTorch (Unofficial)]
  • Nystromformer: "Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention", AAAI, 2021 (UW-Madison). [Paper][PyTorch]
  • RFA: "Random Feature Attention", ICLR, 2021 (DeepMind). [Paper]
  • Performer: "Rethinking Attention with Performers", ICLR, 2021 (Google). [Paper][Code][Blog]
  • DeLight: "DeLighT: Deep and Light-weight Transformer", ICLR, 2021 (UW). [Paper]
  • Synthesizer: "Synthesizer: Rethinking Self-Attention for Transformer Models", ICML, 2021 (Google). [Paper][Tensorflow][PyTorch (leaderj1001)]
  • Poolingformer: "Poolingformer: Long Document Modeling with Pooling Attention", ICML, 2021 (Microsoft). [Paper]
  • Hi-Transformer: "Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling", ACL, 2021 (Tsinghua). [Paper]
  • Smart-Bird: "Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer", arXiv, 2021 (Tsinghua). [Paper]
  • Fastformer: "Fastformer: Additive Attention is All You Need", arXiv, 2021 (Tsinghua). [Paper]
  • ∞-former: "∞-former: Infinite Memory Transformer", arXiv, 2021 (Instituto de Telecomunicações, Portugal). [Paper]
  • cosFormer: "cosFormer: Rethinking Softmax In Attention", ICLR, 2022 (SenseTime). [Paper][PyTorch (davidsvy)]
  • MGK: "Improving Transformers with Probabilistic Attention Keys", ICML, 2022 (UCLA). [Paper]
  • FNet: "FNet: Mixing Tokens with Fourier Transforms", NAACL, 2022 (Google). [Paper]
  • RetNet: "Retentive Network: A Successor to Transformer for Large Language Models", arXiv, 2023 (Microsoft). [Paper][PyTorch (in construction)]

[Back to Overview]

Attention for Both

  • Sparse-Transformer: "Generating Long Sequences with Sparse Transformers", arXiv, 2019 (OpenAI). [Paper][Tensorflow][Blog]
  • Reformer: "Reformer: The Efficient Transformer", ICLR, 2020 (Google). [Paper][Tensorflow][Blog]
  • Sinkhorn-Transformer: "Sparse Sinkhorn Attention", ICML, 2020 (Google). [Paper][PyTorch (Unofficial)]
  • Linear-Transformer: "Transformers are rnns: Fast autoregressive transformers with linear attention", ICML, 2020 (Idiap). [Paper][PyTorch][Website]
  • SMYRF: "SMYRF: Efficient Attention using Asymmetric Clustering", NeurIPS, 2020 (UT Austin + Google). [Paper][PyTorch]
  • Routing-Transformer: "Efficient Content-Based Sparse Attention with Routing Transformers", TACL, 2021 (Google). [Paper][Tensorflow][PyTorch (Unofficial)][Slides]
  • LRA: "Long Range Arena: A Benchmark for Efficient Transformers", ICLR, 2021 (Google). [Paper][Tensorflow]
  • OmniNet: "OmniNet: Omnidirectional Representations from Transformers", ICML, 2021 (Google). [Paper]
  • Evolving-Attention: "Evolving Attention with Residual Convolutions", ICML, 2021 (Peking + Microsoft). [Paper]
  • H-Transformer-1D: "H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences", ACL, 2021 (Google). [Paper]
  • Combiner: "Combiner: Full Attention Transformer with Sparse Computation Cost", NeurIPS, 2021 (Google). [Paper]
  • Centroid-Transformer: "Centroid Transformers: Learning to Abstract with Attention", arXiv, 2021 (UT Austin). [Paper]
  • AFT: "An Attention Free Transformer", arXiv, 2021 (Apple). [Paper]
  • Luna: "Luna: Linear Unified Nested Attention", arXiv, 2021 (USC + CMU + Facebook). [Paper]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", arXiv, 2021 (NVIDIA). [Paper]
  • PoNet: "PoNet: Pooling Network for Efficient Token Mixing in Long Sequences", ICLR, 2022 (Alibaba). [Paper]
  • Paramixer: "Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention", CVPR, 2022 (Norwegian University of Science and Technology, Norway). [Paper]
  • FNet: "FNet: Mixing Tokens with Fourier Transforms", NAACL, 2022 (Google). [Paper][JAX]
  • ContextPool: "Efficient Representation Learning via Adaptive Context Pooling", ICML, 2022 (Apple). [Paper]
  • LARA: "Linear Complexity Randomized Self-attention Mechanism", ICML, 2022 (Bytedance). [Paper]
  • Flowformer: "Flowformer: Linearizing Transformers with Conservation Flows", ICML, 2022 (Tsinghua University). [Paper][PyTorch]
  • MRA: "Multi Resolution Analysis (MRA) for Approximate Self-Attention", ICML, 2022 (University of Wisconsin, Madison). [Paper][PyTorch]
  • EcoFormer: "EcoFormer: Energy-Saving Attention with Linear Complexity", NeurIPS, 2022 (Monash University). [Paper][PyTorch]
  • SBM-Transformer: "Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost", NeurIPS, 2022 (LG). [Paper][PyTorch]
  • ?: "Horizontal and Vertical Attention in Transformers", arXiv, 2022 (University of Technology Sydney). [Paper]
  • MRL: "MRL: Learning to Mix with Attention and Convolutions", arXiv, 2022 (Sony). [Paper]
  • RSA: "Encoding Recurrence into Transformers", ICLR, 2023 (HKU). [Paper]
  • EVA: "Efficient Attention via Control Variates", ICLR, 2023 (HKU). [Paper]
  • STTABT: "Sparse Token Transformer with Attention Back Tracking", ICLR, 2023 (KAIST). [Paper]
  • Mega: "Mega: Moving Average Equipped Gated Attention", ICLR, 2023 (Meta). [Paper][PyTorch]
  • SeTformer: "SeTformer is What You Need for Vision and Language", AAAI, 2024 (East China Normal University). [Paper]

[Back to Overview]

Attention for Others

  • Informer: "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting", AAAI, 2021 (Beihang University). [Paper][PyTorch]
  • Attention-Rank-Collapse: "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth", ICML, 2021 (Google + EPFL). [Paper][PyTorch]
  • NPT: "Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning", arXiv, 2021 (Oxford). [Paper]
  • FEDformer: "FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting", ICML, 2022 (Alibaba). [Paper][PyTorch]
  • ?: "Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting", arXiv, 2022 (University of Technology Sydney). [Paper]

[Back to Overview]