历年综述论文分类汇总戳这里↘️ CV-Surveys施工中~~~~~~~~~~
- EventPS: Real-Time Photometric Stereo Using an Event Camera
- pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
- Mip-Splatting: Alias-free 3D Gaussian Splatting
⭐code
🏠project - BioCLIP: A Vision Foundation Model for the Tree of Life
⭐code
- SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency
- Image Processing GNN: Breaking Rigidity in Super-Resolution
- Objects as Volumes: A Stochastic Geometry View of Opaque Solids
- Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition
⭐code用于音频、视频、点云、时间序列和图像识别的通用感知大内核卷积网络 - GPT4Point: A Unified Framework for Point-Language Understanding and Generation点语言理解和生成的统一框架
- AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond用于运动理解、规划、生成等的一体化框架
- Sharingan: A Transformer Architecture for Multi-Person Gaze Following目光跟随
- From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation
- What Sketch Explainability Really Means for Downstream Tasks
- SketchINR: A First Look into Sketches as Implicit Neural Representations
- Open Vocabulary Semantic Scene Sketch Understanding草图理解
- CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention
- MoDE: CLIP Data Experts via Clustering聚类
- Fine-Grained Bipartite Concept Factorization for Clustering
- 多视图聚类
- Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios
- Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering
- Differentiable Information Bottleneck for Deterministic Multi-view Clustering
- Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
⭐code
- ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
⭐code零样本指代表达理解 - Revisiting Counterfactual Problems in Referring Expression Comprehension
- Dexterous Grasp Transformer
- Mean-Shift Feature Transformer
- MLP Can Be A Good Transformer Learner
⭐code - Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers
- Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers
- Dual-scale Transformer for Large-scale Single-Pixel Imaging
- DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets
- Solving Masked Jigsaw Puzzles with Diffusion Transformers
- Towards Understanding and Improving Adversarial Robustness of Vision Transformers
- RMT: Retentive Networks Meet Vision Transformers
⭐code - You Only Need Less Attention at Each Stage in Vision Transformers
- MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
🏠project - Instance-Aware Group Quantization for Vision Transformers
🏠project - Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
⭐code - RepViT: Revisiting Mobile CNN From ViT Perspective
⭐code - Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
- Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers
👍摘要 - Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- On the Faithfulness of Vision Transformer Explanations
- Learning Correlation Structures for Vision Transformers
- Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach
⭐code - Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression
- Point Transformer V3: Simpler Faster Stronger
⭐code - A General and Efficient Training for Transformer via Token Expansion
⭐code - HEAL-SWIN: A Vision Transformer On The Sphere
⭐code - SHViT: Single-Head Vision Transformer with Memory Efficient Macro DesignVision
- TransNeXt: Robust Foveal Visual Perception for Vision Transformers
⭐code - Making Vision Transformers Truly Shift-Equivariant
- Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
⭐code - Random Entangled Tokens for Adversarially Robust Vision Transformer
- Time-Efficient Light-Field Acquisition Using Coded Aperture and Events
🏠project - Continuous Pose for Monocular Cameras in Neural Implicit Representation
⭐code - PanoPose: Self-supervised Relative Pose Estimation for Panoramic Images
🏠project - Unbiased Estimator for Distorted Conics in Camera Calibration
- 相机姿态
- 快照压缩成像
- ManiFPT: Defining and Analyzing Fingerprints of Generative Models
- Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention Alignment and Prompt Tuning生物识别
- 人员识别
- Z*: Zero-shot Style Transfer via Attention Reweighting
- MoST: Motion Style Transformer Between Diverse Action Contents
⭐code - ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation
⭐code
🏠project - Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model
- Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
🏠project - Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network
👍平衡效率与质量,南航提出新风格迁移算法Puff-Net - 零样本文本驱动运动迁移
- Test-Time Linear Out-of-Distribution Detection
- Segment Every Out-of-Distribution Object
- Label-Efficient Group Robustness via Out-of-Distribution Concept Curation
- Enhancing the Power of OOD Detection via Sample-Aware Model SelectionOOD
- Discriminability-Driven Channel Selection for Out-of-Distribution Detection
- CORES: Convolutional Response-based Score for Out-of-distribution Detection
- Learning Transferable Negative Prompts for Out-of-Distribution Detection
⭐code - A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?
⭐code - Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments
- A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?
- 异常检测
- 数据集
- Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset
- 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations
- DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency
- MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
- LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
- 360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries
- Towards Automatic Power Battery Detection: New Challenge Benchmark Dataset and Baseline
- MSU-4S - The Michigan State University Four Seasons Dataset
- DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields
- Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
- LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes
- Advancing Saliency Ranking with Human Fixations: Dataset, Models and Benchmarks
- MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying
- HardMo: A Large-Scale Hardcase Dataset for Motion Capture
- The STVchrono Dataset: Towards Continuous Change Recognition in Time
- Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
- LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising
- On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm
- Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods
- Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
- FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding细粒度动作理解
- MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
🏠project - Traffic Scene Parsing through the TSP6K Dataset
- Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset
- RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos
🏠project
🌻datasetRGB-D object数据集 - eTraM: Event-based Traffic Monitoring Dataset
⭐code
🏠project流量监控数据集 - Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network
🌻dataset - JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
🏠project - TULIP: Multi-camera 3D Precision Assessment of Parkinson's Disease
- JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments
- OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
🏠project - SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
- RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method
🏠project
👍摘要 - MatSynth: A Modern PBR Materials Dataset
🏠project - RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception
⭐code - Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection
- EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
⭐code - MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception
🌻dataset - HouseCat6D -- A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios
- HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative
🌻dataset - DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
🌻dataset - EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
⭐code
🏠project数据集 - LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images
- MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors
⭐code - FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
🏠project - TUMTraf V2X Cooperative Perception Dataset
🏠project - MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures
🌻dataset
- 基准
- When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach
- THRONE: A Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
- M3-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection
⭐code - DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos现实视频中远程点跟踪的基准
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
- MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding
- RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
⭐code - GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
- Advancing Saliency Ranking with Human Fixations: Dataset Models and Benchmarks
- ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
⭐code - UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement
- PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling
🏠project - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
⭐code - Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
- VBench : Comprehensive Benchmark Suite for Video Generative Models
⭐code
🏠project - MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark
- CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs
🏠project - How to Train Neural Field Representations: A Comprehensive Study and Benchmark
- OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
- 弱监督学习
- 部分标签学习
- 半监督
- Targeted Representation Alignment for Open-World Semi-Supervised Learning
- SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder
- CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning
- BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning
- 正样本标签学习
- Positive-Unlabeled Learning by Latent Group-Aware Meta DisambiguationPositive-Unlabeled Learning(正样本标签学习)半监督学习的一个重要分支
- 自监督学习
- Self-supervised Representation Learning from Arbitrary Scenarios
- Self-supervised Debiasing Using Low Rank Regularization
- Self-Supervised Dual Contouring
- Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces
- Self-Supervised Representation Learning from Arbitrary Scenarios
- SD2Event: Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras
- An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing
- Self-supervised debiasing using low rank regularization
- CNC-Net: Self-Supervised Learning for CNC Machining Operations
- 无监督学习
- Efficient Multitask Dense Predictor via Binarization密集预测
- Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models
- Exploiting Diffusion Prior for Generalizable Dense Prediction
🏠project - ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
⭐code
👍百度提出视觉新骨干ViT-CoMer,刷新密集预测任务SOTA - Multi-Task Dense Prediction via Mixture of Low-Rank Experts
⭐code
- Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection
⭐code - 异常检测
- Supervised Anomaly Detection for Complex Industrial Images
- Prompt-enhanced Multiple Instance Learning for Weakly Supervised Anomaly Detection弱监督异常检测
- Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
- Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
- Long-Tailed Anomaly Detection with Learnable Class Names
🏠project - RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection
⭐code - Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts
⭐code - PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection
⭐code
- 薄膜去除
- 基准/数据集
- Towards Accurate and Robust Architectures via Neural Architecture Search
- Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach
- Building Optimal Neural Architectures using Interpretable Knowledge
⭐code - AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search
- SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model
- Insights from the Use of Previously Unseen Neural Architecture Search Datasets
- FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
⭐code
- Equivariant Multi-Modality Image Fusion图像融合
- Task-Customized Mixture of Adapters for General Image Fusion
⭐code - Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
⭐code - Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion
- Neural Spline Fields for Burst Image Fusion and Layer Separation
🏠project - 红外和可见光图像融合
- Language-only Training of Zero-shot Composed Image Retrieval
⭐code - Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods
- Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
- On Train-Test Class Overlap and Detection for Image Retrieval
⭐code - D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval
- Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection
- 跨域检索
- 视频检索
- 跨模态检索
- 文本-视频检索
- 图像-文本检索
- 视频文本检索
- 组合图像检索
- 细粒度图像检索
- 基于草图的检索
- GNN
- Domain Separation Graph Neural Networks for Saliency Object Ranking
- GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs
- FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences
- DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching
- GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds图生成网络
- GCN
- Leveraging Predicate and Triplet Learning for Scene Graph Generation
- OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
- CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
- Multi-Level Neural Scene Graphs for Dynamic Urban Environments
⭐code - HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
⭐code
⭐code - DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
⭐code
🏠project - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
- EGTR: Extracting Graph from Transformer for Scene Graph Generation
⭐code - LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
- Programmable Motion Generation for Open-Set Motion Control Tasks
- Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
- AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents
- Towards Variable and Coordinated Holistic Co-Speech Motion Generation
⭐code - Generating Human Motion in 3D Scenes from Text Descriptions根据文本描述生成 3D 场景中的人体运动
- NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
🏠project人体运动合成 - OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers
🏠project - WANDR: Intention-guided Human Motion Generation
📺video - MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion
🏠project - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
- Multimodal Sense-Informed Forecasting of 3D Human Motions
- 运动检索
- 动物运动
- 人体运动预测
- MoML: Online Meta Adaptation for 3D Human Motion Prediction
- MoST: Multi-Modality Scene Tokenization for Motion Prediction
- Rethinking Human Motion Prediction with Symplectic Integral
- Human Motion Prediction Under Unexpected Perturbation
- Continual Learning for Motion Prediction Model via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy
- 人体运动估计
- 人体运动重建
- GRAM: Global Reasoning for Multi-Page VQA
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
🏠project - Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
- How to Configure Good In-Context Sequence for Visual Question Answering
⭐code - Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models
- Question Aware Vision Transformer for Multimodal Reasoning
- OpenEQA: Embodied Question Answering in the Era of Foundation Models
- Video-QA
- Grounded Question-Answering in Long Egocentric Videos
⭐code - Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
- Language-aware Visual Semantic Distillation for Video Question Answering
- MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
- Can I Trust Your Answer? Visually Grounded Video Question Answering
⭐code - Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering
- Grounded Question-Answering in Long Egocentric Videos
- 图表问答
- 视觉文本问答
- 场景文本识别
- OTE: Exploring Accurate Scene Text Recognition Using One Token
- An Empirical Study of Scaling Law for Scene Text Recognition
⭐code场景文本识别 - Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
⭐code - Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction
- Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing场景文本识别、删除和编辑
- ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting
⭐code
- 场景文本图像合成
- 场景文本理解
- 化学结构识别
- 文档色度检测
- 文本检测
- 文档理解
- 字体生成
- Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle
🏠project - Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking
🏠project - 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
⭐code
🏠project - 文本和图像引导 4D 场景生成
- 4D视图合成
- 语言到 4D 建模
- Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding
- PanoContext-Former: Panoramic Total Scene Understanding with a Transformer
- DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving
- OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies
⭐code - A Category Agnostic Model for Visual Rearrangment
👍VILP - 360+x: A Panoptic Multi-modal Scene Understanding Dataset
⭐code - 开放词汇场景理解
- 3D场景理解
- HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
🏠project - SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field
- GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
- GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding
- RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding
🏠project - GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding
⭐code - SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
- HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
- Exploring Pose-Aware Human-Object Interaction via Hybrid Learning
- Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness
- Scaling Up Dynamic Human-Scene Interaction Modeling
⭐code
🏠project - ReGenNet: Towards Human Action-Reaction Synthesis
⭐code - DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
⭐code交互 - HOI-M^3: Capture Multiple Humans and Objects Interaction within Contextual Environment
- GenZI: Zero-Shot 3D Human-Scene Interaction Generation
- Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection
- 人体运动跟踪
- 新运动合成
- 手部交互
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
⭐code - HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
- Physics-Aware Hand-Object Interaction Denoising
- HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video
⭐code
🏠project手物交互 - GEARS: Local Geometry-aware Hand-object Interaction Synthesis
- TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding
🏠project - Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction
⭐code - G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
⭐code - MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision
- HOIST-Former: Hand-held Objects Identification Segmentation and Tracking in the Wild
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
- 人物交互
- Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
- Open-World Human-Object Interaction Detection via Multi-modal Prompts
- LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
⭐code
🏠project - Disentangled Pre-training for Human-Object Interaction Detection
⭐code - GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation Demonstration and Imitation
🏠project - Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition
🏠project - Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation
🏠project - 3D 人物交互
- 人-人交互
- GARField: Group Anything with Radiance Fields
- IReNe: Instant Recoloring of Neural Radiance Fields
- PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF
- LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes
- SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
- NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation
- SpecNeRF: Gaussian Directional Encoding for Specular Reflections
- PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based ReferenceNeRF
- Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes
👍摘要 - NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs
- Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling
⭐code - Accelerating Neural Field Training via Soft Mining
- Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling
🏠project - How Far Can We Compress Instant-NGP-Based NeRF?
⭐code
🏠project - BANF: Band-Limited Neural Fields for Levels of Detail Reconstruction
⭐code
🏠project - Tactile-Augmented Radiance Fields
⭐code
🏠project - NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild
🏠project - L0-Sampler: An L0 Model Guided Volume Sampling for NeRF
🏠projectNeRF - HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses
- Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields
⭐code - NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation
- MuRF: Multi-Baseline Radiance Fields
🏠project
🏠project - InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields
- NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors
⭐code
🏠project - Neural Fields as Distributions: Signal Processing Beyond Euclidean Space
🏠project - CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs
⭐code - DaReNeRF: Direction-aware Representation for Dynamic Scenes
- Geometry Transfer for Stylizing Radiance Fields
🏠project - S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes
⭐code - SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream
⭐code - Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes
⭐code - Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates
⭐code - LAENeRF: Local Appearance Editing for Neural Radiance Fields
⭐code
🏠project - Single View Refractive Index Tomography with Neural Fields
- ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models
- TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video
- NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation
⭐code - Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency
⭐code - Grounding and Enhancing Grid-based Models for Neural Fields
🏠project - Mitigating Motion Blur in Neural Radiance Fields with Events and Frames
- OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos
- Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects
⭐code - Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields
- Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields
🏠project - Dynamic LiDAR Re-simulation using Compositional Neural Fields
🏠project - SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields
🏠project - ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization
- NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows
- 新视图合成
- ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
- Unifying Correspondence Pose and NeRF for Generalized Pose-Free Novel View Synthesis
- NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis
- 3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis
- G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images
- MultiDiff: Consistent Novel View Synthesis from a Single Image
- Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis
- DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- 3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis
- Generalizable Novel-View Synthesis using a Stereo Camera
🏠project - DART: Implicit Doppler Tomography for Radar Novel View Synthesis
🏠project - XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold
- Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
⭐code
🏠project - NViST: In the Wild New View Synthesis from a Single Image with Transformers
⭐code
🏠project - ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
🏠project - SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
⭐code
🏠project - Neural Visibility Field for Uncertainty-Driven Active Mapping
🏠project - EscherNet: A Generative Model for Scalable View Synthesis
⭐code
🏠project - GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
⭐code
🏠project新视图 - DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
⭐code
🏠project - LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis
⭐code
🏠project - Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?
- Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
⭐code
🏠project - CoPoNeRF: Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs
⭐code
🏠project - EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
⭐code
🏠project - Free3D: Consistent Novel View Synthesis without 3D Representation
⭐code
🏠project - Novel View Synthesis with View-Dependent Effects from a Single Image
🏠project
- 渲染
- NeRF Director: Revisiting View Selection in Neural Volume Rendering
- Multiplane Prior Guided Few-Shot Aerial Scene Rendering渲染
- Differentiable Point-based Inverse Rendering逆渲染
- Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance渲染
- Perceptual Assessment and Optimization of HDR Image Rendering
- Global Latent Neural Rendering
- Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields
⭐code
🏠project - GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering
🏠project - Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination
🏠project
📺video
👍借助神经结构光,浙大实现动态三维现象的实时采集重建 - Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields
🏠project - Dr.Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering
🏠project - HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
👍HiFi4G: 通过紧凑高斯进行高保真人体性能渲染 - ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering
🏠project - SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild
🏠project神经渲染 - LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering
- HashPoint: Accelerated Point Searching and Sampling for Neural Rendering
🏠project - HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
🏠project - DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling
⭐code
🏠project - Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras
🏠project - ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis
🏠project
- 多视图逆渲染
- 目标重建
- Describing Differences in Image Sets with Natural Language
- 实体识别
- 提示学习
- BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
- Active Prompt Learning in Vision Language Models
⭐code - Domain Prompt Learning with Quaternion Networks
- On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning?
- ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection
- 基础模型
- MuGE: Multiple Granularity Edge Detection
- RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses
⭐code
- Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views
- Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception
- 行人检测
- 人群计数
- 行人属性检测
- 重识别
- SEAS: ShapE-Aligned Supervision for Person Re-Identification
- Learning Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification
⭐code - View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network
⭐code - CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification
- Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability
- All in One Framework for Multimodal Re-identification in the Wild
- A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification
- Distribution-aware Knowledge Prototyping for Non-exemplar Lifelong Person Re-identification
- Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
⭐code - 基于雷达的Re-Id
- 可见光-红外人员重识别
- 文本-图像重识别
- 步态识别
- MC
- KD
- Small Scale Data-Free Knowledge Distillation
- KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling
- Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation
- Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities
- C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation
- CrossKD: Cross-Head Knowledge Distillation for Object Detection
- CLIP-KD: An Empirical Study of CLIP Model Distillation
⭐code - Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
- FreeKD: Knowledge Distillation via Semantic Frequency Prompt
- Logit Standardization in Knowledge Distillation
- $V_kD:$ Improving Knowledge Distillation using Orthogonal Projections
⭐code - Scale Decoupled Distillation
⭐code - NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation
⭐code - De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts
- PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
⭐code
🏠project
👍中文解读
- 剪枝
- Device-Wise Federated Network Pruning
- FedMef: Towards Memory-efficient Federated Dynamic Pruning
- OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning
- BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks
- Resource-Efficient Transformer Pruning for Finetuning of Large Models
- Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch
- Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning
🏠project - Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
- MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
- Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment
- MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
⭐code
- 量化
- PTQ4SAM: Post-Training Quantization for Segment Anything
- Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector
- Data-Free Quantization via Pseudo-label Filtering
- JointSQ: Joint Sparsification-Quantization for Distributed Learning
- Retraining-Free Model Quantization via One-Shot Weight-Coupling Learning
- Epistemic Uncertainty Quantification For Pre-Trained Neural Networks
- Enhancing Post-training Quantization Calibration through Contrastive Learning
- Towards Accurate Post-training Quantization for Diffusion Models量化
- Is Conventional SNN Really Efficient? A Perspective from Network Quantization
- Are Conventional SNNs Really Efficient? A Perspective from Network Quantization
- Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization
⭐code - Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
⭐code - Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery
🏠project - S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data
- Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
🏠project - WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification
⭐code - Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels
⭐code - 遥感
- GeoChat: Grounded Large Vision-Language Model for Remote Sensing
- SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
- 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
⭐code - Poly Kernel Inception Network for Remote Sensing Detection
- Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening
⭐code
- 航空图像分割
- 基于参考图像的超分辨率
- 基于UAV的目标检测
- 交叉视角定位
- A Vision Check-up for Language Models
- The Neglected Tails in Vision-Language Models
- Beyond Average: Individualized Visual Scanpath Prediction
- ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
- Language Models as Black-Box Optimizers for Vision-Language Models
- Distilling Vision-Language Models on Millions of Videos
- SonicVisionLM: Playing Sound with Vision Language Models
- Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
- MMA: Multi-Modal Adapter for Vision-Language Models
- Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
- Building Vision-Language Models on Solid Foundations with Masked Distillation
- TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model
⭐code - On Scaling Up a Multilingual Vision and Language Model
- CogAgent: A Visual Language Model for GUI Agents
⭐code - Towards Better Vision-Inspired Vision-Language Models
- SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training
- MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
- Sequential Modeling Enables Scalable Learning for Large Vision Models
🏠project大型视觉模型 - Seeing the Unseen: Visual Common Sense for Semantic Placement
- Efficient Vision-Language Pre-training by Cluster Masking
⭐code
🏠project - VILA: On Pre-training for Visual Language Models
- EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
⭐code
🏠project - SPIN: Simultaneous Perception Interaction and Navigation
- MAFA: Managing False Negatives for Vision-Language Pre-training
- Visual In-Context Prompting
⭐code - Semantics-aware Motion Retargeting with Vision-Language Models
- DePT: Decoupled Prompt Tuning
⭐code - Osprey: Pixel Understanding with Visual Instruction Tuning
⭐code - FairCLIP: Harnessing Fairness in Vision-Language Learning
🏠project - Efficient Test-Time Adaptation of Vision-Language Models
⭐code - BioCLIP: A Vision Foundation Model for the Tree of Life
⭐code - InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
⭐code - Anchor-based Robust Finetuning of Vision-Language Models
- Multi-Modal Hallucination Control by Visual Information Grounding
- Do Vision and Language Encoders Represent the World Similarly?
- Dual-View Visual Contextualization for Web Navigation
- Any-Shift Prompting for Generalization over Distributions
- Non-autoregressive Sequence-to-Sequence Vision-Language Models
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
⭐code - SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
⭐code - RegionGPT: Towards Region Understanding Vision Language Model
- Enhancing Vision-Language Pre-training with Rich Supervisions
- Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
⭐code - Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
- Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
⭐code - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
⭐code视觉语言构图理解 - FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models
- [Enhancing Vision-Language Pretraining with Rich Supervisions]
- Improved Baselines with Visual Instruction Tuning
🏠project - Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
- Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
⭐code - A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
⭐code - Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
⭐code - SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining视觉-语言
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
⭐code - Iterated Learning Improves Compositionality in Large Vision-Language Models
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era
⭐code - Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
⭐code - Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
🏠project - Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
🏠project - HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
⭐code - Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
- Learning Vision from Models Rivals Learning Vision from Data
⭐code - Probing the 3D Awareness of Visual Foundation Models
⭐code - LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
🏠project - 视觉理解
- LLM
- PixelLM: Pixel Reasoning with Large Multimodal Model
🏠project - OneLLM: One Framework to Align All Modalities with Language
- Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld
- Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
- Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
- Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
- See Say and Segment: Teaching LMMs to Overcome False Premises
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
- Driving Everywhere with Large Language Model Policy Adaptation
🏠project - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
- GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
🏠project - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
🏠project - V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
⭐code - Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- Pixel Aligned Language Models
🏠project - SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
- OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
⭐code多模态大语言模型 - Low-Rank Approximation for Sparse Attention in Multi-Modal LLMsLLMs
- LISA: Reasoning Segmentation via Large Language Model
⭐code - Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model
- Compositional Chain-of-Thought Prompting for Large Multimodal Models
⭐code - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
🏠project - Honeybee: Locality-enhanced Projector for Multimodal LLM
⭐codeLLM - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
⭐code - SEED-Bench: Benchmarking Multimodal Large Language Models
⭐code - PerceptionGPT: Effectively Fusing Visual Perception into LLM
- UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- ModaVerse: Efficiently Transforming Modalities with LLMs
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models
⭐code
🏠project - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
- MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
🏠project大语言模型 - RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
⭐code - DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
⭐code
👍摘要 - Prompt Highlighter: Interactive Control for Multi-Modal LLMs
⭐code
🏠project - Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
🏠project - General Object Foundation Model for Images and Videos at Scale
⭐code
🏠project
👍GLEE 华科与字节跳动联手打造全能目标感知基础模型 - Link-Context Learning for Multimodal LLMs
⭐codeLLMs - Cloud-Device Collaborative Learning for Multimodal Large Language Models
- LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model
⭐code
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM - Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
⭐code
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM - LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
⭐code
🏠projectMLLMs - GSVA: Generalized Segmentation via Multimodal Large Language Models
- PixelLM: Pixel Reasoning with Large Multimodal Model
- VLN
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
⭐code
👍VILP - Volumetric Environment Representation for Vision-Language Navigation
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
- Vision-and-Language Navigation via Causal Learning
⭐code视觉和语言导航
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
- 视频语言
- VidLA: Video-Language Alignment at Scale
- SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling
- VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
- VideoLLM-online: Online Video Large Language Model for Streaming Video
🏠project
- Visual Grounding
- Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
- MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
- Viewpoint-Aware Visual Grounding in 3D Scenes
- Improved Visual Grounding through Self-Consistent Explanations
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
🏠project - Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and ConsistencyVisual Grounding
- Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding
- Multi-Attribute Interactions Matter for 3D Visual Grounding
- Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
- 多模态模型
- GLaMM: Pixel Grounding Large Multimodal Model
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
⭐code - What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
🏠project - Multi-modal Learning for Geospatial Vegetation Forecasting
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
- MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- TRINS: Towards Multimodal Language Models that Can Read
- Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
- 视觉基础模型
- 多视图理解
- 视觉定位
- CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion
⭐code - WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights
- 图像隐写术
- 知识产权保护
- Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models
- MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection
⭐code - CPR: Retrieval Augmented Generation for Copyright Protection
- VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
⭐code - Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models
- IP 保护
- 3D Feature Tracking via Event Camera
- Projecting Trackable Thermal Patterns for Dynamic Computer Vision
- ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe
- DPHMs: Diffusion Parametric Head Models for Depth-based Tracking
🏠project - NetTrack: Tracking Highly Dynamic Objects with a Net
⭐code - RTracker: Recoverable Tracking via PN Tree Structured Memory
- Context-Aware Integration of Language and Visual References for Natural Language Tracking
- CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras
- SpatialTracker: Tracking Any 2D Pixels in 3D Space
⭐code - Learning Tracking Representations from Single Point Annotations
- 视觉目标跟踪
- DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking
- OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
- HIPTrack: Visual Tracking with Historical Prompts
⭐code - Single-Model and Any-Modality for Video Object Tracking
⭐code - SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking
⭐code
- 多目标跟踪
- Multi-Object Tracking in the Dark
- Towards Generalizable Multi-Object Tracking
- ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association
- DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking
- Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
⭐code - Self-Supervised Multi-Object Tracking with Path Consistency
- DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
⭐code
🏠project - iKUN: Speak to Trackers without Retraining
⭐code
- 点跟踪
- Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision
👍摘要 - Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization
🏠project - Circuit Design and Efficient Simulation of Quantum Inner Product and Empirical Studies of Its Effect on Near-Term Hybrid Quantum-Classic Machine Learning
- 对抗
- Infrared Adversarial Car Stickers
- Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples
- Revisiting Adversarial Training Under Long-Tailed Distributions
- PAD: Patch-Agnostic Defense against Adversarial Patch Attacks
⭐code - Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training
- MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model对抗性扰动
- Towards Transferable Targeted 3D Adversarial Attack in the Physical World
- Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack攻击
- Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models
- Re-thinking Data Availability Attacks Against Deep Neural Networks
- SlowFormer: Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers
- Re-thinking Data Availablity Attacks Against Deep Neural Networks攻击
- NAPGuard: Towards Detecting Naturalistic Adversarial Patches
- Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training
- Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers后门攻击
- Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World
- Backdoor Defense via Test-Time Detecting and Repairing
- Nearest Is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks
- Semantic-Aware Multi-Label Adversarial Attacks对抗攻击
- Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning
- Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement对抗攻击
- On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
- Incremental Residual Concept Bottleneck Models
- Revisiting Adversarial Training at Scale
⭐code - Language-Driven Anchors for Zero-Shot Adversarial Robustness零样本对抗
- Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training
- Learning to Transform Dynamically for Better Adversarial Transferability
- Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay
- Boosting Adversarial Transferability by Block Shuffle and Rotation
⭐code对抗性可转移性 - MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models
- Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
👍VILP - Adversaral Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights
- PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
- Revisiting Adversarial Training under Long-Tailed Distributions
⭐code - Towards Fairness-Aware Adversarial Learning
- Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning
- Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement
- Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM
- Boosting Adversarial Training via Fisher-Rao Norm-based Regularization
⭐code - A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning攻击
- 后门攻击
- 持续学习
- RCL: Reliable Continual Learning for Unified Failure Detection
- Consistent Prompting for Rehearsal-Free Continual Learning
- Improving Plasticity in Online Continual Learning via Collaborative Learning
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
⭐code - Enhancing Visual Continual Learning with Language-Guided Supervision
- Convolutional Prompting meets Language Models for Continual Learning
- Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning
- Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation
- InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning
- Learning Equi-angular Representations for Online Continual Learning
- BrainWash: A Poisoning Attack to Forget in Continual Learning
- Adaptive VIO: Deep Visual-Inertial Odometry with Online Continual Learning持续学习
- Traceable Federated Continual Learning
- Interactive Continual Learning: Fast and Slow Thinking
- 增量学习
- 类增量学习
- Dual-Consistency Model Inversion for Non-Exemplar Class Incremental Learning
- Class Incremental Learning with Multi-Teacher Distillation
- Dual-Enhanced Coreset Selection with Class-wise Collaboration for Online Blurry Class Incremental Learning
- Generative Multi-modal Models are Good Class Incremental Learners
- FCS: Feature Calibration and Separation for Non-Exemplar Class Incremental Learning
- OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning
- Long-Tail Class Incremental Learning via Independent Sub-prototype Construction
- Gradient Reweighting: Towards Imbalanced Class-Incremental Learning
- DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning
- NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning
⭐code - Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning
⭐code - Text-Enhanced Data-free Approach for Federated Class-Incremental Learning
⭐code - Generative Multi-modal Models are Good Class-Incremental Learners
⭐code - Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning
⭐code
- 多任务
- Masked AutoDecoder is Effective Multi-Task Vision Generalist
- OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
- Task-conditioned adaptation of visual features in multi-task policy learning
- DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
⭐code - FedHCA2: Towards Hetero-Client Federated Multi-Task Learning
⭐code - MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning
- Joint-Task Regularization for Partially Labeled Multi-Task Learning
- Task-Conditioned Adaptation of Visual Features in Multi-Task Policy Learning
- 多标签学习
- 多视角学习
- 元学习
- 联邦学习
- An Aggregation-Free Federated Learning for Tackling Data Heterogeneity
- Decentralized Directed Collaboration for Personalized Federated Learning
- Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data
- Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping
- FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning
- Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space
- Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning
- Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices
- FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning
- Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity
⭐code - Global and Local Prompts Cooperation via Optimal Transport for Federated Learning
- PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees
⭐code - Relaxed Contrastive Learning for Federated Learning
- DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
- FedAS: Bridging Inconsistency in Personalized Federated Learning
⭐code - Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning
- Data Valuation and Detections in Federated Learning
⭐code - An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning
⭐code - Adaptive Hyper-graph Aggregation for Modality-Agnostic Federated Learning
- FedUV: Uniformity and Variance for Heterogeneous Federated Learning
- FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning
- Communication-Efficient Federated Learning with Accelerated Client Gradient
- 强化学习
- Improving Unsupervised Hierarchical Representation with Reinforcement Learning
- AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning强化学习
- Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
- POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning
- DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning
- Learning to Control Camera Exposure via Reinforcement Learning
🏠project - Regularized Parameter Uncertainty for Improving Generalization in Reinforcement Learning
- Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
🏠project
- 多模态机器学习
- 迁移学习
- Model Inversion Robustness: Can Transfer Learning Help?
- Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning
- Structured Model Probing: Empowering Efficient Transfer Learning by Structured Regularization
- UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
⭐code - Initialization Matters for Adversarial Transfer Learning
- 对比学习
- Improving Graph Contrastive Learning via Adaptive Positive Sampling
- MaskCLR: Attention-Guided Contrastive Learning for Robust Action Representation Learning
- BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
- Universal Novelty Detection Through Adaptive Contrastive Learning
- NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
🏠project
- 模仿学习
- 上下文学习
- 弱监督学习
- 启示学习
- Hearing Anything Anywhere
🏠project - Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
- AV-RIR: Audio-Visual Room Impulse Response Estimation
📺video - DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
⭐code - Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling
- 视听对话
- 视听导航
- 视听分割
- Audio-Visual Segmentation via Unlabeled Frame Exploitation
- Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos
- Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation
- Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
⭐code - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
⭐code
- 语音识别
- 语音定位
- 音-视语音表示学习
- 文本驱动的语音定位
- 从图像和语言提示合成音乐
- 耳音频生成和定位
- 视频和音频同步
- 视听表征学习
- 说话人检测
- 音频描述
- 视听语音翻译
- AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
- Preserving Fairness Generalization in Deepfake Detection
⭐code - Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection
⭐code - Exploiting Style Latent Flows for Generalizing Deepfake Video Detection
- LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection
- LAA-Net: Localized Artifact Attention Network for High-Quality Deepfakes Detection
- Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
- Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking
- 图像篡改检测
- DiffForensics: Leveraging Diffusion Prior to Image Forgery Detection and Localization伪造图像检测
- EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection
⭐code用于篡改定位和版权保护的多功能图像水印 - UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization图像操作检测和定位
- CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image
- 合成图像检测
- Transductive Zero-Shot and Few-Shot CLIP
⭐code - DG
- Disentangled Prompt Representation for Domain Generalization
- A2XP: Towards Private Domain Generalization
⭐code
🏠project - PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization
- Towards Generalizing to Unseen Domains with Few Labels
- Rethinking the Evaluation Protocol of Domain Generalization
- Rethinking Multi-domain Generalization with A General Learning Objective
- Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
⭐code - Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization
- Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization
- DA
- Parameter Efficient Self-Supervised Geospatial Domain Adaptation
- Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation
- Discriminative Pattern Calibration Mechanism for Source-Free Domain Adaptation
- Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective
- A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability
- Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation
- Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation
- Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
- Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer
- LEAD: Learning Decomposition for Source-free Universal Domain Adaptation
⭐code - Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
⭐code - Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
⭐code - Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias
- Unified Language-driven Zero-shot Domain Adaptation
🏠project
- FSL
- Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners
- Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning
- Simple Semantic-Aided Few-Shot Learning
⭐code - DeIL: Direct-and-Inverse CLIP for Open-World Few-Shot Learning
- AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning
👍摘要 - Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning
⭐code - Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning
- Few-shot Learner Parameterization by Diffusion Time-steps
⭐code
- ZSL
- Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning
👍提升生成式零样本学习能力,视觉增强动态语义原型方法 - Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning
- Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning
⭐code - Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from External Class Names
- Efficient Meshflow and Optical Flow Estimation from Event Cameras
- UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model
⭐code - FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking
- FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models
- ADFactory: An Effective Framework for Generalizing Optical Flow with NeRF
- Dense Optical Tracking: Connecting the Dots
⭐code
🏠project光流 - MemFlow: Optical Flow Estimation and Prediction with Memory
⭐code - OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation
- 场景流
- 3D 场景流估计
- 3D-LFM: Lifting Foundation Model
🏠project - Efficient Solution of Point-Line Absolute Pose
⭐code - Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval
- DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses
⭐code - Dynamic Support Information Mining for Category-Agnostic Pose Estimation
- From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation
- 物体姿态估计
- 6DoF
- HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation
- Towards Co-Evaluation of Cameras HDR and Algorithms for Industrial-Grade 6DoF Pose Estimation
- Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
- SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
⭐code - FAR: Flexible Accurate and Robust 6DoF Relative Camera Pose Estimation
⭐code
🏠project - 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation
- MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images
- Open-Vocabulary Object 6D Pose Estimation
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
🏠project - GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects
- Open-vocabulary object 6D pose estimation
🏠project - SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation
⭐code - A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization
- Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation
- MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation
- Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge
- 重识别
- 计数
- Instance Tracking in 3D Scenes from Egocentric Videos
- VPR
- 导航
- Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation
👍VILP - Detours for Navigating Instructional Videos旅游视频导航
- MemoNav: Working Memory Model for Visual Navigation
- DiaLoc: An Iterative Approach to Embodied Dialog Localization
- F$^3$Loc: Fusion and Filtering for Floorplan Localization
- An Interactive Navigation Method with Effect-oriented Affordance交互式导航
- Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation
- SLAM
- SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System
⭐code - SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments
- SNI-SLAM: Semantic Neural Implicit SLAM
- Gaussian Splatting SLAM
⭐code
🏠project - SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
🏠project - NARUTO: Neural Active Reconstruction from Uncertain Target Observations
- Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM
- Implicit Event-RGBD Neural SLAM
⭐code
🏠project - Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization
- IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM
- Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras
- Loopy-SLAM: Dense Neural SLAM with Loop Closures
🏠project - GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting
🏠project
- SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System
- 机器人
- Retrieval-Augmented Embodied Agents
- ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
- SUGAR: Pre-training 3D Visual Representations for Robotics
🏠project - Learning to navigate efficiently and precisely in real environments
- Language-driven Grasp Detection
- CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation
🏠project - Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
⭐code - Diffusion-EDFs:Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
⭐code
🏠project - Rapid Motor Adaptation for Robotic Manipulator Arms机器人机械臂
- Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts
- Avatar(虚拟建模)
- SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
⭐code
📺video - MonoNPHM: Dynamic Head Reconstruction from Monocular Videos
- Relightable and Animatable Neural Avatar from Sparse-View Video
🏠project - Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes
⭐code - Artist-Friendly Relightable and Animatable Neural Heads
- GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
⭐code - DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
- EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars
- Stratified Avatar Generation from Sparse Observations(https://zerg-overmind.github.io/)
📺video - Real-Time Simulated Avatar from Head-Mounted Sensors
🏠project - Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
⭐code
🏠project - NECA: Neural Customizable Human Avatar
⭐code - Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework
⭐code - GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
⭐code
🏠project - Gaussian Head Avatar:Ultra High-fidelity Head Avatar via Dynamic Gaussians
⭐code
🏠project - GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
🏠project - UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
⭐code
🏠project - GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
🏠project - 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
🏠project3D动画 - AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing3D 人体头像生成
- Human Gaussian Splatting: Real-time Rendering of Animatable Avatars
- GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
⭐code - DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation
🏠project - PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
🏠project - FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
🏠project - MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
🏠project人体图像动画 - Relightable Gaussian Codec Avatars
🏠project - IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing
🏠project
- SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
- 头发建模
- 虚拟试穿
- Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
- CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model
- M&M VTO: Multi-Garment Virtual Try-On and Editing
🏠project - StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
⭐code - PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns
🏠project
- 抓取
- 卡通人物
- Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs
⭐code - SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
- 自动驾驶
- Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory
- VLP: Vision Language Planning for Autonomous Driving
- Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles
- DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
- Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving
- DualAD: Disentangling the Dynamic and Static World for End-to-End Driving
🏠project - UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
⭐code - Generalized Predictive Model for Autonomous Driving
⭐code - Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications
- ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles
- Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models
- LMDrive: Closed-Loop End-to-End Driving with Large Language Models
⭐code
🏠project - Feedback-Guided Autonomous Driving
- PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving
- Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving
⭐code - On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving
- Visual Point Cloud Forecasting enables Scalable Autonomous Driving
⭐code - Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving
⭐code - CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow
- Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving
- AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving
- NeuRAD: Neural Rendering for Autonomous Driving
⭐code
🏠project - Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
⭐code
🏠project - Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents
⭐code
🏠project - 3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation
⭐code - PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios
⭐code
📺video - Bootstrapping Autonomous Driving Radars with Self-Supervised Learning
⭐code - SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving自动驾驶去雾
- 轨迹预测
- Pose-Transformed Equivariant Network for 3D Point Trajectory Prediction
- Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving
- CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving
- Higher-order Relational Reasoning for Pedestrian Trajectory Prediction
- Density-Adaptive Model Based on Motif Matrix for Multi-Agent Trajectory Prediction
- GigaTraj: Predicting Long-term Trajectories of Hundreds of Pedestrians in Gigapixel Complex Scenes
- ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments
- HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention
⭐code
👍VILP - Adapting to Length Shift: FlexiLength Network for Trajectory Prediction
- OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising
⭐code - SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction行人轨迹预测
- T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory
⭐code - Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations
- SmartRefine: An Scenario-Adaptive Refinement Framework for Efficient Motion Prediction
⭐code - Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
- SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model
⭐code - Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
⭐code - Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture
⭐code
- 车道线检测
- 车载凝视估计
- 3D Occupancy Prediction
- COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
⭐code - SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
⭐code - StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation 车辆重识别
- Day-Night Cross-domain Vehicle Re-identification
- COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
- Single-View Scene Point Cloud Human Grasp Generation
- LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling
- StraightPCF: Straight Point Cloud Filtering
- CurveCloudNet: Processing Point Clouds with 1D Structure
- Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange
- Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform
- Multiway Point Cloud Mosaicking with Diffusion and Global Optimization
📺video - Point Cloud Pre-training with Diffusion Models
- PBWR: Parametric-Building-Wireframe Reconstruction from Aerial LiDAR Point Clouds
- Unsupervised Occupancy Learning from Sparse Point Cloud
- TULIP: Transformer for Upsampling of LiDAR Point Clouds
- Draw Step by Step Like Human: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion点云重建 CAD
- Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis
⭐code - Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis
⭐code - Unsupervised Template-assisted Point Cloud Shape Correspondence Network
- GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds
- Object Dynamics Modeling with Hierarchical Point Cloud-based Representations
- KPConvX: Modernizing Kernel Point Convolution with Kernel Attention
- 点云配准
- Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension
- Inlier Confidence Calibration for Point Cloud Registration
- ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion
- Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration
- Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes
- 3D 点云
- Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds
⭐code - Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds
⭐code
👍摘要 - Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds
- Text2Loc: 3D Point Cloud Localization from Natural Language
🏠project - 3DInAction: Understanding Human Actions in 3D Point Clouds
⭐code - Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching
⭐code
- Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds
- 点云识别
- 点云上采样
- 点云分割
- 点云分析
- 点云理解
- 点云生成
- 点云去噪
- 点云分类
- 点云质量评估
- Semantic Line Combination Detector
- Language-conditioned Detection Transformer
- Unsupervised Salient Instance Detection
- Neural Exposure Fusion for High-Dynamic Range Object Detection
- LEOD: Label-Efficient Object Detection for Event Cameras
- SFOD: Spiking Fusion Object Detector
⭐code - Exploring Orthogonality in Open World Object Detection
⭐code - What How and When Should Object Detectors Update in Continually Changing Test Domains?
- Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes
- SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
- Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement
⭐code - Theoretically Achieving Continuous Representation of Oriented Bounding Boxes
⭐code - RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
- Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation
⭐code - CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation
🏠project - DETRs Beat YOLOs on Real-time Object Detection
🏠project - Hyperbolic Learning with Synthetic Captions for Open-World Detection
- Overload: Latency Attacks on Object Detection for Edge Devices
- YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection
- Active Domain Adaptation with False Negative Prediction for Object Detection
- RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation
⭐code - Active Object Detection with Knowledge Aggregation and Distillation from Large Models
- GLOW: Global Layout Aware Attacks on Object Detection
- Plug and Play Active Learning for Object Detection
⭐code - InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
🏠project - Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
- Generating Enhanced Negatives for Training Language-Based Object Detectors
- SAR目标检测
- 3D目标检测
- Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection
- PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection
- Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection
- CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection
- Weakly Supervised Monocular 3D Detection with a Single-View Image
- Weak-to-Strong 3D Object Detection with X-Ray Distillation
- GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection
- BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection
- Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions
- HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes
- Commonsense Prototype for Outdoor Unsupervised 3D Object Detection
⭐code
👍摘要 - Multi-View Attentive Contextualization for Multi-View 3D Object Detection
- BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection
- An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains
- SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects
⭐code - HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection
👍摘要 - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
🏠project - UniMODE: Unified Monocular 3D Object Detection
- Learning Occupancy for Monocular 3D Object Detection
⭐code - CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images
⭐code - VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection
⭐code - Improving Distant 3D Object Detection Using 2D Box Supervision
- SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection
⭐code - Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors
⭐code - IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
⭐code - RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection
⭐code - Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection
- SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects
⭐code - MonoCD: Monocular 3D Object Detection with Complementary Depths
⭐code
- 小目标检测
- 显著目标检测
- 定向目标检测
- 小样本目标检测
- 域泛化目标检测
- 域适应目标检测
- 开放式目标检测
- 半监督目标检测
- 端到端目标检测
- 开放词汇目标检测
- Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection
- The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding
🏠project - Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection
- OVMR: Open-Vocabulary Recognition with Multi-Modal References开放词汇识别
- SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
- YOLO-World: Real-Time Open-Vocabulary Object Detection
⭐code - Retrieval-Augmented Open-Vocabulary Object Detection
⭐code - Taming Self-Training for Open-Vocabulary Object Detection
⭐code - Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection
- DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
- 视频伪装目标检测
- 基于事件的目标检测
- 联合显著性目标检测
- 开集识别
- 物体识别
- 目标发现
- 目标定位
- STMixer: A One-Stage Sparse Action Detector
- Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
- Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence
⭐code - Selective, Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
⭐code - Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
⭐code
🏠project - X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
⭐code - LLMs are Good Action Recognizers
- Action Detection via an Image Diffusion Process
- Language Model Guided Interpretable Video Action Reasoning
⭐code - SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
🏠project - TIM: A Time Interval Machine for Audio-Visual Action Recognition
⭐code - VicTR: Video-conditioned Text Representations for Activity Recognition
- Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
🏠project - Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
⭐code - Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
- Modality-Collaborative Test-Time Adaptation for Action Recognition
- CPR-Coach: Recognizing Composite Error Actions based on Single-class Training
- 基于骨架的动作识别
- 基于事件的动作识别
- 零样本动作识别
- 细粒度动作识别
- 动作定位
- 时序动作检测
- Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions
⭐code - TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression
- End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
⭐code - Low-power, Continuous Remote Behavioral Localization with Event Cameras
🏠project - Dual DETRs for Multi-Label Temporal Action Detection
- Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions
- 动作质量评估
- 群体活动识别
- 人体动作理解
- 动作预期
- 行为定位
- CLOAF: CoLlisiOn-Aware Human Flow
- Meta-Point Learning and Refining for Category-Agnostic Pose Estimation
- SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
⭐code - GALA: Generating Animatable Layered Assets from a Single Scan
⭐code
🏠project - ShapeMatcher: Self-Supervised Joint Shape Canonicalization Segmentation Retrieval and Deformation自监督关节形状规范化、分割、检索和变形
- 手部
- Authentic Hand Avatar from a Phone Scan via Universal Hand Model
- URHand: Universal Relightable Hands
🏠project - OHTA: One-shot Hand Avatar via Data-driven Implicit Priors
⭐code
🏠project - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
⭐code
🏠project - Reconstructing Hands in 3D with Transformers
- 3D手部姿态估计
- 手部网格重建
- 手部网格恢复
- 手部姿态跟踪
- 手部纹理重建
- 手势合成
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
🏠project - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
🏠project - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
🏠project - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
🏠project
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
- 人体
- LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment
- AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation
- LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging
- Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors
- Fast Adaptation for Human Pose Estimation via Meta-Optimization
- RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control
⭐code - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation
⭐code - 多人姿势估计
- 3D 人体
- Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi
- Cross-view and Cross-pose Completion for 3D Human Understanding
- TexVocab: Texture Vocabulary-conditioned Human Avatars
🏠project - MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models
- ChatPose: Chatting about 3D Human Pose
🏠project - SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation
⭐code - Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting
- FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations
🏠project - FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models
- PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization
- Score-Guided Diffusion for 3D Human Recovery
⭐code - A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation
- KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation
⭐code - Multiple View Geometry Transformers for 3D Human Pose Estimation
⭐code - Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling
- Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning
⭐code - EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams
🏠project - 3D Human Pose Perception from Egocentric Stereo Videos
⭐code
🏠project - Forecasting of 3D Whole-body Human Poses with Grasping Objects3D 全身人体姿势
- BodyMAP -- Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed
⭐code
🏠project - MeshPose: Unifying DensePose and 3D Body Mesh Reconstruction
- Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
- Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
⭐code
👍让视频姿态Transformer变得飞速,北大提出高效三维人体姿态估计框架HoT - Optimizing Diffusion Noise Can Serve As Universal Motion Priors
⭐code
🏠project - En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data
⭐code
🏠project
- 人体网格恢复/重建
- DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery
- Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction
- PostureHMR: Posture Transformation for 3D Human Mesh Recovery
- Probabilistic Human Mesh Estimation with Hypothesis Scoring
- KITRO:Refining Human Mesh by 2D Clues and Kinematic-tree Rotation
⭐code - Semantic Human Mesh Reconstruction with Textures
🏠project - TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation
🏠project - SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes人体网格
- R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization
- DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion
🏠project - Synergistic Global-space Camera and Human Reconstruction from Videos
- SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
- SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction
- 动作捕捉
- ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning
🏠project - Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement
- Capturing Closely Interacted Two-Person Motions with Reaction Priors
🏠project - Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket
⭐code - Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera
⭐code
🏠project
- ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning
- 3D人体生成
- HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
⭐code
🏠project - FlowMDM: Seamless Human Motion Composition with Blended Positional Encodings
⭐code
🏠project人体运动合成 - HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion
🏠project - Gaussian Shell Maps for Efficient 3D Human Generation
- HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
- 语音驱动的人体动画
- 文本提示的人体动画
- 手语翻译
- 3D姿势迁移
- 人体重建
- MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
- Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption
⭐code - Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer
⭐code - HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models
- ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D Image
- Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field
- VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift
⭐code - WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion
🏠project3D 运动重建
- 类别无关的姿势估计
- 视频估计人体动力学
- 人体姿势回归
- 3D人体模型
- 人体生成
- FairRAG: Fair Human Generation via Fair Retrieval Augmentation
- HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting
🏠project文本驱动 3D 人体生成 - Joint2Human: High-Quality 3D Human Generation via Compact Spherical Embedding of 3D Joints
🏠project - MoMask: Generative Masked Modeling of 3D Human Motions
🏠project3D 人体运动
- 人体运动理解
- 人体形状
- 舞蹈生成
- DisCo: Disentangled Control for Realistic Human Dance Generation
🏠project - POPDG: Popular 3D Dance Generation with PopDanceSet
- DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
⭐code - Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
🏠project - Bidirectional Autoregessive Diffusion Model for Dance Generation
- DisCo: Disentangled Control for Realistic Human Dance Generation
- Learning from One Continuous Video Stream
- Deep Video Inverse Tone Mapping Based on Temporal Clues
- VTimeLLM: Empower LLM to Grasp Video Moments
- Combining Frame and GOP Embeddings for Neural Video Representation
- Learning to Predict Activity Progress by Self-Supervised Video Alignment
- CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- vid-TLDR: Training Free Token Merging for Light-weight Video Transformer
⭐code - Video2Game: Real-time Interactive Realistic and Browser-Compatible Environment from a Single Video
⭐code - Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
- Understanding Video Transformers via Universal Concept Discovery
- Video Recognition in Portrait Mode
🏠project - VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams
🏠project - Just Add π! Pose Induced Video Transformers for Understanding Activities of Daily Living
⭐code - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
- [Reliable Video Teller via Equal Distance to Visual Tokens]
- Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
🏠project - Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings
- Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models
- 睡眠监测
- 视频理解
- Compositional Video Understanding with Spatiotemporal Structure-based Transformers
- Action Scene Graphs for Long-Form Understanding of Egocentric Videos
- HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
- A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives
🏠project - Koala: Key Frame-Conditioned Long Video-LLM
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
⭐code - Abductive Ego-View Accident Video Understanding for Safe Driving Perception
🏠project - OmniVid: A Generative Framework for Universal Video Understanding
⭐code - A Unified Framework for Human-centric Point Cloud Video Understanding
- Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
- MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
🏠project - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
⭐code - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
⭐code
- 视频摘要
- 视频重建
- 视频表示
- 视频判读
- 电影描述
- 视频监控
- 视频预测
- 视频稳定
- 视频识别
- 视频对话
- 视频重照明
- 视频和谐化
- 视频帧插值
- Video Frame Interpolation via Direct Synthesis with the Event-based Reference
- IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation
- EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling
- TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation
- Sparse Global Matching for Video Frame Interpolation with Large Motion
⭐code - Perception-Oriented Video Frame Interpolation via Asymmetric Blending
⭐code
👍视频插帧视觉效果新突破!上海交大提出PerVFI,视频插帧新范式 - SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
🏠project
- 视频主题交换
- 视频异常检测
- Open-Vocabulary Video Anomaly Detection
- Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning
- Harnessing Large Language Models for Training-free Video Anomaly Detection
⭐code - Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline
⭐code - Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
- MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
- PREGO: Online Mistake Detection in PRocedural EGOcentric Videos
- Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
⭐code - Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
- GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?
🏠project大型多模态模型能否检测视频游戏故障
- 视频场景检测
- 视频镜像检测
- 自动生成电影预告片
- 视频对话式音乐推荐系统
- Video Paragraph Grounding
- video Grounding
- SnAG: Scalable and Accurate Video Grounding
⭐code - Context-Guided Spatio-Temporal Video Grounding
⭐code - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
- What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
- SnAG: Scalable and Accurate Video Grounding
- Rapid 3D Model Generation with Intuitive 3D Input
- Instantaneous Perception of Moving Objects in 3D
- NEAT: Distilling 3D Wireframes from Neural Attraction Fields
⭐code - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
- LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction
- TexOct: Generating Textures of 3D Models with Octree-based Diffusion
- Unsupervised 3D Structure Inference from Category-Specific Image Collections
- Garment Recovery with Shape and Deformation Priors
- ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
⭐code - CAGE: Controllable Articulation GEneration
⭐code
🏠project3D - Sparse views, Near light: A practical paradigm for uncalibrated point-light photometric stereo
- Dispersed Structured Light for Hyperspectral 3D Imaging
- G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping
⭐code - Wonder3D: Single Image to 3D using Cross-Domain Diffusion
🏠project - UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence
⭐code服装操作 - GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo
⭐code
⭐code - EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Priors
🏠project - MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation
🏠project - Digital Life Project: Autonomous 3D Characters with Social Intelligence
🏠project - Image Sculpting: Precise Object Editing with 3D Geometry Control
🏠project - TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations
- Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception
- GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
⭐code
🏠project - SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds
- ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images
- Differentiable Display Photometric Stereo
- ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion
⭐code
🏠project - Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps
- REACTO: Reconstructing Articulated Objects from a Single Video
⭐code - Low-Latency Neural Stereo Streaming
- Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes
- Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation
🏠project - Wired Perspectives: Multi-View Wire Art Embraces Generative AI
⭐code
🏠project - Memory-based Adapters for Online 3D Scene Perception
⭐code - FastMAC: Stochastic Spectral Sampling of Correspondence Graph
⭐code - One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
⭐code
🏠project - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
🏠project - CityDreamer: Compositional Generative Model of Unbounded 3D Cities
🏠project - EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
⭐code - Mosaic-SDF for 3D Generative Models
🏠project - Federated Online Adaptation for Deep Stereo
🏠project - ControlRoom3D: Room Generation using Semantic Proxy Rooms
- 三维视觉
- 三维重建
- 3D Neural Edge Reconstruction
- 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces
🏠project
📺video - PanoRecon: Real-Time Panoptic 3D Reconstruction from Monocular Video
- NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images
- NTO3D: Neural Target Object 3D Reconstruction with Segment Anything
- pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
- ReconFusion: 3D Reconstruction with Diffusion Priors
- VGGSfM: Visual Geometry Grounded Deep Structure From Motion
⭐code
🏠project - Slice3D: Multi-Slice Occlusion-Revealing Single View 3D Reconstruction
🏠project - GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction
- Coherence As Texture - Passive Textureless 3D Reconstruction by Self-interference
⭐code - Structure-Aware Sparse-View X-ray 3D Reconstruction
⭐code
👍如何给 NeRF 开透视眼? - Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
🏠project - Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning
⭐code
🏠project多视图重建 - PlatoNeRF: 3D Reconstruction in Plato’s Cave via Single-View Two-Bounce Lidar
🏠project - WonderJourney: Going from Anywhere to Everywhere
🏠project - Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
⭐code
🏠project - DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans
- IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images
⭐code - Splatter Image: Ultra-Fast Single-View 3D Reconstruction
⭐code
🏠project - PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar
⭐code
🏠project - MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections
⭐code - ZeroShape: Regression-based Zero-shot Shape Reconstruction
⭐code
🏠project - DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction
- G3DR: Generative 3D Reconstruction in ImageNet
⭐code
🏠project - 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface
⭐code - Bayesian Diffusion Models for 3D Shape Reconstruction
- RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction
- ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining
🏠project视图 360° 重建
- 表面重建
- SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration
- MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction
- MorpheuS: Neural Dynamic 360deg Surface Reconstruction from Monocular RGB-D Video
⭐code
🏠project - UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and UnFavOrable Data Sets
⭐code
⭐code - UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets
⭐code
- 三维网格重建
- 三维形状
- GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors
- TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
⭐code - Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
🏠project - ShapeWalk: Compositional Shape Editing Through Language-Guided Chains
⭐code
🏠project - Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation
- Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
🏠project - FSC: Few-point Shape Completion
- 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation
⭐code
🏠project3D 形状 - Category-Level Multi-Part Multi-Joint 3D Shape Assembly
- Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation
- Stereo Matching
- Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching
⭐code - LoS: Local Structure-Guided Stereo Matching
- Robust Synthetic-to-Real Transfer for Stereo Matching
- Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching
- Neural Markov Random Field for Stereo Matching
⭐code - Reusable Architecture Growth for Continual Stereo Matching
- MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
⭐code
🏠project - Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching
- Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching
- 表面法线估计
- 特征匹配
- 三维检索
- 深度补全
- Flexible Depth Completion for Sparse and Varying Point Densities
- Improving Depth Completion via Depth Feature Upsampling
- Test-Time Adaptation for Depth Completion
- Bilateral Propagation Network for Depth Completion
- DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions
- Tri-Perspective View Decomposition for Geometry-Aware Depth Completion
⭐code
- 深度估计
- Cross-spectral Gated-RGB Stereo Depth Estimation
- Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation
- Depth Prompting for Sensor-Agnostic Depth Estimation
- Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion
⭐code - On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
🏠project - Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation
⭐code - PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
- Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
🏠project - Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion
- ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
⭐code - From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior
- UniDepth: Universal Monocular Metric Depth Estimation
⭐code - WorDepth: Variational Language Prior for Monocular Depth Estimation
- SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing
- Snapshot Lidar: Fourier Embedding of Amplitude and Phase for Single-Image Depth Reconstruction
- 全景定位
- 3D关键点检测
- 布局重建
- CAD 重建
- 形状匹配
- 3DGS
- COLMAP-Free 3D Gaussian Splatting
⭐code
🏠project - Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields
- GS-IR: 3D Gaussian Splatting for Inverse Rendering
- FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization
🏠project - Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
🏠project - GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
🏠project - Mip-Splatting: Alias-free 3D Gaussian Splatting
⭐code
🏠project - CoGS: Controllable Gaussian Splatting
⭐code
🏠project - LangSplat: 3D Language Gaussian Splatting
⭐code
🏠project - Compact 3D Gaussian Representation for Radiance Field
🏠project - 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos
🏠project - [HUGS: Human Gaussian Splatting]
- HUGS: Human Gaussian Splats
⭐code
🏠project - Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering3DGS
- GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces
- COLMAP-Free 3D Gaussian Splatting
- 场景重建
- Gated Fields: Learning Scene Reconstruction from Gated Videos
- Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses
- SuperPrimitive: Scene Reconstruction at a Primitive Level
🏠project - Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction
⭐code - Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion
- OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees
- VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction
⭐code - Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
⭐code
🏠project
👍CVPR 2024满分论文:浙大提出基于可变形三维高斯的高质量单目动态重建新方法 - Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts
🏠project
- 3D 场景合成
- GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
🏠project - DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation
⭐code - BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation
🏠project3D 场景生成 - Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion
- 文本驱动的 3D 场景生成
- GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
- 3D 场景图
- 3D 场景编辑
- GaussianEditor:Editing 3D Gaussians Delicately with Text Instructions
🏠project - Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training
👍文本或图像提示精准编辑3D场景,美图&信工所&北航&中大联合提出3D编辑方法CustomNeRF - PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
⭐code - Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes
🏠project3D 场景 - PAPR in Motion: Seamless Point-level 3D Scene Interpolation3D 场景插值
- ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing
- GaussianEditor:Editing 3D Gaussians Delicately with Text Instructions
- 语义匹配
- 室内照明估计
- 三维服装生成
- 3D 形状匹配
- Brain Decodes Deep Nets
- Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology
⭐code - MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning
- Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling
- Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning
⭐code - MindBridge: A Cross-Subject Brain Decoding Framework
⭐code
⭐code - MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
- Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images
⭐code - PairAug: What Can Augmented Image-Text Pairs Do for Radiology?
⭐code - Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images
- C^2RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction
⭐code - VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis
- Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts
- CT
- 切片分类
- Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
- Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis
⭐code - Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification
🏠project - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification
- 肿瘤合成
- 病理检测
- 基因检测
- 癌症检测
- 医学图像配准
- 医学图像分类
- 医学图像分割
- Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation
- One-Prompt to Segment All Medical Images
- Diversified and Personalized Multi-rater Medical Image Segmentation
⭐code - Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
⭐code - Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation
- Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation
- MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling
- EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
- Tyche: Stochastic In-Context Learning for Medical Image Segmentation
- Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention
🏠project - Clustering Propagation for Universal Medical Image Segmentation
- Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling无监督语义分割
- MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation
⭐code超声心动图视频分割 - Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation
- Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation
⭐code - Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation
⭐code细胞核分割 - PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness
⭐code半监督乳腺病变分割 - PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation全景肾脏病理分割
- Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation
⭐code
- X-ray
- MRI
- 异常检测
- 脑活动
- 生存预测
- 计算病理学
- 组织病理学
- SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology
- CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment
- Prompting Vision Foundation Models for Pathology Image Analysis
- Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
🏠project
- 医学超分辨率
- 3D医学影像
- 放射学报告生成
- 放射学报告检索
- 医学基础模型
- 肿瘤分割
- 基因表达预测
- Unsupervised Gaze Representation Learning from Multi-view Face Images
- ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing
- PairDETR : Joint Detection and Association of Human Bodies and Faces
- Neural Implicit Morphing of Face Images
🏠project - SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models
- Anatomically Constrained Implicit Face Models
- Face2Diffusion for Fast and Editable Face Personalization
⭐code
⭐code - Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
⭐code - Self-Supervised Facial Representation Learning with Facial Region Awareness
- Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction
- VOODOO 3D: Volumetric Portrait Disentanglement For One-Shot 3D Head Reenactment
- 人脸编辑
- 人脸表情
- 人脸识别
- OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition(抑郁症识别)
- Privacy-Preserving Face Recognition Using Trainable Feature Subtraction
⭐code - KeyPoint Relative Position Encoding for Face Recognition
- LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition
- Validating Privacy-Preserving Face Recognition under a Minimum Assumption
- 人脸合成
- Deformable One-shot Face Stylization via DINO Semantic Guidance
⭐code - Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
- DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation
- LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example
⭐code
🏠project - Text-Guided 3D Face Synthesis - From Generation to Editing
- Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis
- UV-IDM: Identity-Conditioned Latent Diffusion Model for Face UV-Texture Generation
- Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- Deformable One-shot Face Stylization via DINO Semantic Guidance
- 人脸重建
- High-Quality Facial Geometry and Appearance Capture at Home
⭐code
🏠project - Monocular Identity-Conditioned Facial Reflectance Reconstruction
⭐code
👍三维数字人重建、编辑与驱动 - 3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation
⭐code - 3D-Aware Face Editing via Warping-Guided Latent Direction Learning
🏠project
👍三维数字人重建、编辑与驱动
- High-Quality Facial Geometry and Appearance Capture at Home
- 人脸修饰
- 人脸重现
- 人脸恢复
- 人脸去识别
- 人脸化妆
- 人脸关键点
- 人脸属性分类
- 人脸活体检测
- One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning
- Rethinking Generalizable Face Anti-spoofing via Hierarchical Prototype-guided Distribution Refinement in Hyperbolic Space
- CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing
- Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing
⭐code - Gradient Alignment for Cross-Domain Face Anti-Spoofing
⭐code - Test-Time Domain Generalization for Face Anti-Spoofing
- Gradient Alignment for Cross-domain Face Anti-Spoofing
⭐code
- 人脸动作单元
- 人脸图像质量
- 肖像编辑
- 头发重建
- 三维人脸
- 4D 头像合成
- 头像重建
- 说话头合成
- Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis
- Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation
- SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
⭐code
🏠project - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
⭐code - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
⭐code
🏠project - FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
- 防御人脸编辑滥用
- 3D 头像
- 化妆迁移
- 人脸重识别
- 年龄估计
- 情绪识别
- L-MAGIC: Language Model Assisted Generation of Images with Coherence
- CapsFusion: Rethinking Image-Text Data at Scale
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- Scaling Laws of Synthetic Images for Model Training ... for Now
- An edit friendly ddpm noise space: inversion and manipulations
⭐code
🏠project - CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation
⭐code
🏠project - CapHuman: Capture Your Moments in Parallel Universes
⭐code
🏠project - Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles
🏠project - IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
- TexTile: A Differentiable Metric for Texture Tileability
🏠project - SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
⭐code
🏠project - MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
⭐code - Text-Image Alignment for Diffusion-Based Perception
- AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
- FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
⭐code - It's All About Your Sketch: Democratising Sketch Control in Diffusion Models
⭐code - Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling
- ProMark: Proactive Diffusion Watermarking for Causal Attribution
- DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception
- GAN
- StyLitGAN: Image-Based Relighting via Latent Control
⭐code
🏠project - StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
- What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
🏠project - Diversity-aware Channel Pruning for StyleGAN Compression
⭐code - Adversarial Score Distillation: When score distillation meets GAN
⭐code
🏠project
- StyLitGAN: Image-Based Relighting via Latent Control
- 扩散
- Fixed Point Diffusion Models
🏠project - Diffusion Models Without Attention
- Image Neural Field Diffusion Models
- Functional Diffusion
🏠project - Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
- Learned Representation-Guided Diffusion Models for Large-Image Generation
- ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- LightIt: Illumination Modeling and Control for Diffusion Models
- Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner
- MMA-Diffusion: MultiModal Attack on Diffusion Models
- CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images
- Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?
- Self-correcting LLM-controlled Diffusion Models
- Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- SODA: Bottleneck Diffusion Models for Representation Learning
- PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
- Don't drop your samples! Coherence-aware training benefits Conditional diffusion
🏠project - Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture
- DiffLoc: Diffusion Model for Outdoor LiDAR Localization
👍摘要 - EasyDrag: Efficient Point-based Manipulation on Diffusion Models
- Distilling ODE Solvers of Diffusion Models into Smaller Steps
- Cache Me if You Can: Accelerating Diffusion Models through Block Caching
🏠project - Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples
- AAMDM: Accelerated Auto-regressive Motion Diffusion Model
- DeepCache: Accelerating Diffusion Models for Free
🏠project - Diffusion Model Alignment Using Direct Preference Optimization
- Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models
- Analyzing and Improving the Training Dynamics of Diffusion Models
- Residual Learning in Diffusion Models
- FreeU: Free Lunch in Diffusion U-Net
⭐code
🏠project - VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
⭐code
🏠project - Diff-BGM: A Diffusion Model for Video Background Music Generation视频背景音乐生成的扩散模型
- Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models
- Shadow Generation for Composite Image Using Diffusion Model
⭐code - Alchemist: Parametric Control of Material Properties with Diffusion Models
- Orthogonal Adaptation for Modular Customization of Diffusion Models
🏠project扩散模型 - Observation-Guided Diffusion Probabilistic Models
⭐code - TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
⭐code - Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models
- SPAD: Spatially Aware Multi-View Diffusers
🏠project - Structure-Guided Adversarial Training of Diffusion Models
- One-step Diffusion with Distribution Matching Distillation
🏠project - Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
⭐code - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models
⭐code
🏠project - X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
🏠project - Readout Guidance: Learning Control from Diffusion Features
🏠project - PointInfinity: Resolution-Invariant Point Diffusion Models
🏠project - Unsupervised Keypoints from Pretrained Diffusion Models
⭐code - Amodal Completion via Progressive Mixed Context Diffusion
⭐code
🏠project - SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
🏠project - DREAM: Diffusion Rectification and Estimation-Adaptive Models
- Towards Memorization-Free Diffusion Models
- Efficient Dataset Distillation via Minimax Diffusion
⭐code - MatFuse: Controllable Material Generation with Diffusion Models
⭐code
🏠project - Accelerating Diffusion Sampling with Optimized Time Steps
- Boosting Diffusion Models with Moving Average Sampling in Frequency Domain
- One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications
⭐code
🏠project - Balancing Act: Distribution-Guided Debiasing in Diffusion Models
⭐code - Shadow Generation for Composite Image Using Diffusion model
⭐code - MACE: Mass Concept Erasure in Diffusion Models
⭐code - DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
🏠project
🏠project
⭐code - Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
⭐code
🏠project - DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
⭐code
🏠project - SVGDreamer: Text Guided SVG Generation with Diffusion Model
⭐code
🏠project
👍SVGDreamer: 北航&港大发布全新文本引导的矢量图形可微渲染方法 - Relation Rectification in Diffusion Model
⭐code
🏠project
- Fixed Point Diffusion Models
- 图像合成/生成
- 图像合成
- One-Shot Structure-Aware Stylized Image Synthesis
- AnyScene: Customized Image Synthesis with Composited Foreground
- Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance
- ViewFusion: Towards Multi-View Consistency via Interpolated Denoising
⭐code - PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks
- Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
- Unmixing Before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis
- [Unlocking Pretrained Image Backbones for Semantic Image Synthesis]
- Unlocking Pre-trained Image Backbones for Semantic Image Synthesis
- Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis
- 场景-文本图像合成
- 图像生成
- [ElasticDiffusion: Training-free Arbitrary Size Image Generation]
- ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation
🏠project - SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- AnyDoor: Zero-shot Object-level Image Customization
🏠project - Taming Stable Diffusion for Text to 360 Panorama Image Generation
⭐code
⭐code - Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations
- Generative Image Dynamics
🏠project - Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- UniGS: Unified Representation for Image Generation and Segmentation
⭐code图像生成 - Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
- FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
⭐code
🏠project - Adversarial Text to Continuous Image Generation
- Style Aligned Image Generation via Shared Attention
🏠project - CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- InstanceDiffusion: Instance-level Control for Image Generation
⭐code
🏠project - DemoFusion: Democratising High-Resolution Image Generation With No $$$
⭐code
🏠project - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
⭐code
🏠project
⭐code - When StyleGAN Meets Stable Diffusion:a W+ Adapter for Personalized Image Generation
⭐code
🏠project - Correcting Diffusion Generation through Resampling
⭐code - Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
- Condition-Aware Neural Network for Controlled Image Generation
- A Unified and Interpretable Emotion Representation and Expression Generation
⭐code - Rethinking FID: Towards a Better Evaluation Metric for Image Generation
- 主题驱动的图像生成
- 文本-图像
- Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis
- Learning Multi-Dimensional Human Preference for Text-to-Image Generation
- Customization Assistant for Text-to-Image Generation
- TokenCompose: Text-to-Image Diffusion with Token-level Supervision
- FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- Personalized Residuals for Concept-Driven Text-to-Image Generation
🏠project - Rich Human Feedback for Text-to-Image Generation
- MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
- Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
- Customization Assistant for Text-to-image Generation
- SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
- MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
🏠project - Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
🏠project - Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models
- DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
🏠project - UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
- Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
- Countering Personalized Text-to-Image Generation with Influence Watermarks
- Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
🏠project - Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting
⭐code - InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion
- Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
🏠project - LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
⭐code
🏠project - HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
🏠project - PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
🏠project - On the Scalability of Diffusion-based Text-to-Image Generation
- Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
🏠project - EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
⭐code - Grounded Text-to-Image Synthesis with Attention Refocusing
🏠project - OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
⭐code - Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
⭐code - CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models
- InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
⭐code - Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
- Cross Initialization for Face Personalization of Text-to-Image Models文本到图像Cross Initialization for Personalized Text-to-Image Generation
- CosmicMan: A Text-to-Image Foundation Model for Humans
⭐code - Dynamic Prompt Optimizing for Text-to-Image Generation
⭐code - WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
- Attention Calibration for Disentangled Text-to-Image Personalization
- RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
⭐code - InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
⭐code
🏠project - Learning Continuous 3D Words for Text-to-Image Generation
⭐code
🏠project - NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging
⭐code - HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
🏠project - Discriminative Probing and Tuning for Text-to-Image Generation
⭐code
🏠project - Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
- ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
⭐code
🏠project - FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
⭐code - MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning
- 主题-图像
- High-fidelity Person-centric Subject-to-Image Synthesis
⭐code - [High Fidelity Person-centric Subject-to-Image Synthesis]
- High-fidelity Person-centric Subject-to-Image Synthesis
- 图像合成
- 视频合成/生成
- 视频生成
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
⭐code
🏠project - Make Pixels Dance: High-Dynamic Video Generation
- GenTron: Diffusion Transformers for Image and Video Generation
- Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
- Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
🏠project - DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
- VideoBooth: Diffusion-based Video Generation with Image Prompts
🏠project - Hierarchical Patch Diffusion Models for High-Resolution Video Generation
⭐code - On the Content Bias in Fréchet Video Distance
⭐code - 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
- SimDA: Simple Diffusion Adapter for Efficient Video Generation
⭐code
🏠project - GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos视频生成
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
🏠project
⭐code - Vlogger: Make Your Dream A Vlog
⭐code
🏠project - LAMP: Learn A Motion Pattern for Few-Shot Video Generation
🏠project - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
🏠project - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
⭐code - BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
⭐code
🏠project视频合成 - DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
🏠project - PEEKABOO: Interactive Video Generation via Masked-Diffusion
🏠project
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- 文本-视频
- Grid Diffusion Models for Text-to-Video Generation
- Breathing Life Into Sketches Using Text-to-Video Priors
⭐code
🏠project - Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
- TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models
🏠project - Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
- Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
🏠project - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
🏠project - TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
⭐code - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
🏠project - VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
🏠project - MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
- 图像-视频
- 视频-视频
- 视频生成
- 纹理生成/合成
- 文本-纹理合成
- 纹理合成
- 文本-3D
- DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
⭐code
🏠project - PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion
- Text-to-3D using Gaussian Splatting
⭐code
🏠project - DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling
- RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D
- Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
- Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior
⭐code - LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
⭐code - Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion
🏠project - Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior
🏠project文本到 3D - Taming Mode Collapse in Score Distillation for Text-to-3D Generation
🏠project - Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
🏠project - DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
⭐code - VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation
⭐code - GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
⭐code
🏠project - Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences
- DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors
🏠project - HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation
- Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
🏠project - HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
🏠project
- DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
- 图像-3D
- 文本-4D
- 3D生成
- DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
🏠project - XCube (X3): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
🏠project - CAD: Photorealistic 3D Generation via Adversarial Distillation
⭐code
🏠project - Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models
⭐code3D 内容 - Interactive3D: Create What You Want by Interactive 3D Generation
- DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
- 语义场景生成
- 场景补全
- Unleashing Network Potentials for Semantic Scene Completion
- Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation
⭐code - Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
⭐code3D 语义 - PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness
- Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion
- 图像-图像翻译
- 图像检测
- 图像编辑
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
⭐code - Emu Edit: Precise Image Editing via Recognition and Generation Tasks
- An Edit Friendly DDPM Noise Space: Inversion and Manipulations
- Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing
- DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing
⭐code - DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
⭐code
🏠project - UniHuman: A Unified Model For Editing Human Images in the Wild
⭐code - Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
🏠project - Inversion-Free Image Editing with Language-Guided Diffusion Models
⭐code
🏠project - TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
⭐code - Edit One for All: Interactive Batch Image Editing
⭐code
🏠project - SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
🏠project - On Exact Inversion of DPM-Solvers
⭐code
🏠project - Doubly Abductive Counterfactual Inference for Text-based Image Editing
⭐code基于文本的图像编辑 - Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
- ZONE: Zero-Shot Instruction-Guided Local Editing
- HIVE: Harnessing Human Feedback for Instructional Visual Editing
- FreeDrag: Feature Dragging for Reliable Point-based Image Editing
- The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
⭐code - Text-Driven Image Editing via Learnable Regions
⭐code
🏠project - LEDITS++: Limitless Image Editing using Text-to-Image Models
⭐code
🏠project - SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
⭐code
🏠project - Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
⭐code
🏠project - PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor
⭐code
🏠project - Referring Image Editing: Object-level Image Editing via Referring Expressions
- Prompt Augmentation for Self-supervised Text-guided Image Manipulation
- Named Entity Driven Zero-Shot Image Manipulation
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- 布局生成
- Constrained Layout Generation with Factor Graphs
- SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control
- MaskPLAN: Masked Generative Layout Planning from Partial Input
⭐code - Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
⭐code
🏠project - Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation
🏠project
- 手写数学表达式
- NeRF-to-NeRF
- GenN2N: Generative NeRF2NeRF TranslationNeRF-to-NeRF
- 生成伪装图像
- 场景生成
- 交互式编辑
- 视频编辑
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
🏠project
📺video - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
⭐code
🏠project - A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
🏠project - Video-P2P: Video Editing with Cross-attention Control
🏠project - VidToMe: Video Token Merging for Zero-Shot Video Editing
🏠project - Video Interpolation with Diffusion Models
⭐code - MotionEditor: Editing Video Motion via Content-Aware Diffusion
- CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-driven Video Editing
⭐code - DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
🏠project
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
- 漫画生成
- 文本驱动 3D 风格化
- Image Warping
- 图像重建
- 图像拼接
- 姿势引导的人体图像合成
- 文本引导的人体图像合成
- 文本图像对齐
- 基于文本的图像色调调整
- 图像矢量化
- 文本-矢量
- 矢量字体
- 矢量图形合成
- 二维码生成
- 背景替换
- 去鬼影
- 去阴影
- 去模糊
- Unsupervised Blind Image Deblurring Based on Self-Enhancement
- Latency Correction for Event-guided Deblurring and Frame Interpolation
- LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network
- ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation
- Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains
⭐code - AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
⭐code
⭐code - A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning
⭐code
- 去雾
- 去噪
- Real-World Mobile Image Denoising Dataset with Efficient Baselines
- GenesisTex: Adapting Image Denoising Diffusion to Texture Space
- Robust Image Denoising through Adversarial Frequency Mixup
- Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios
- Masked and Shuffled Blind Spot Denoising for Real-World Images
- LAN: Learning to Adapt Noise for Image Denoising
- Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising
- Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation
- Transfer CLIP for Generalizable Image Denoising
- Residual Denoising Diffusion Models
⭐code - Equivariant plug-and-play image reconstruction
⭐code - Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching
- Hyper-MD: Mesh Denoising with Customized Parameters Aware of Noise Intensity and Geometric Characteristics
- Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images
👍中文简介
- 去雨
- 去反射
- 修图
- 图像增强
- Color Shift Estimation-and-Correction for Image Enhancement
- FlowIE:Efficient Image Enhancement via Rectified Flow
- Fourier Priors-Guided Diffusion for Zero-Shot Joint Low-Light Enhancement and Deblurring
- Specularity Factorization for Low-Light Enhancement
- Zero-Reference Low-Light Enhancement via Physical Quadruple Priors
⭐code - Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach
⭐code - Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance
- Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving
- 图像恢复
- Learning Diffusion Texture Priors for Image Restoration
- CoDe: An Explicit Content Decoupling Framework for Image Restoration
- Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration
- Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
- Look-Up Table Compression for Efficient Image Restoration
- HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models
- DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
- Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance
⭐code - Deep Equilibrium Diffusion Restoration with Parallel Sampling
⭐code - Distilling Semantic Priors from SAM to Efficient Image Restoration Models
- Boosting Image Restoration via Priors from Pre-trained Models
- Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration
- Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model
⭐code - Restoration by Generation with Constrained Priors
🏠project - Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration
- Improving Image Restoration through Removing Degradations in Textual Representations
⭐code
- 图像修复
- Brush2Prompt: Contextual Prompt Generator for Object Inpainting
- Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting
- NeRFiller: Completing Scenes via Generative 3D Inpainting
- MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior3D 修复
- Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
⭐code
- 图像超级补全
- 图像质量
- Blind Image Quality Assessment Based on Geometric Order Learning
- Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization
- Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment
- TextCraftor: Your Text Encoder Can be Image Quality Controller
- Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement
- 恶劣天气消除
- 大气湍流去除
- Image Portrait Relighting(图像重照光)
- 图片缩小
- 图像校正
- 图像着色
- 运动(去)模糊
- Motion Blur Decomposition with Cross-shutter Guidance
- Spike-guided Motion Deblurring with Unknown Modal Spatiotemporal Alignment
⭐code - Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization
- Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring
- Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring
⭐code
- 视频修复
- 视频去雾
- 视频去渲染
- 视频去模糊
- Frequency-aware Event-based Video Deblurring for Real-World Motion Blur
- Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
⭐code
🏠project - FMA-Net: Flow Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring
⭐code
🏠project - DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video
🏠project
- 视频增强
- 视频质量评估
- PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild
- Learned Scanpaths Aid Blind Panoramic Video Quality Assessment
- Modular Blind Video Quality Assessment
- KVQ: Kwai Video Quality Assessment for Short-form Videos
- CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement
⭐code
- 夜间颜色恒定
- 照明估计
- Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
- Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
⭐code
🏠project - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
⭐code - MeaCap: Memory-Augmented Zero-shot Image Captioning
⭐code - Sieve: Multimodal Dataset Pruning using Image Captioning Models
- [EVCap: Retrieval-Augmented Image Captioning with External Visual--Name Memory for Open-World Comprehension]
- EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
- 视频描述/字幕
- Streaming Dense Video Captioning
⭐code
⭐code - Video ReCap: Recursive Captioning of Hour-Long Videos
⭐code
🏠project
🌻dataset - Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- VideoCon: Robust Video-Language Alignment via Contrast Captions
⭐code
🏠project - Retrieval-Augmented Egocentric Video Captioning
- Streaming Dense Video Captioning
- 密集字幕
- 生成图解说明
- 视频压缩
- 图像压缩
- Towards Backward-Compatible Continual Learning of Image Compression
⭐code - Generative Latent Coding for Ultra-Low Bitrate Image Compression
- Dual Prior Unfolding for Snapshot Compressive Imaging
- Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain
- SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image
⭐code - JDEC: JPEG Decoding via Enhanced Continuous Cosine CoefficientsJPEG 解码
- Learned Lossless Image Compression based on Bit Plane Slicing
- Towards Backward-Compatible Continual Learning of Image Compression
- Image Processing GNN: Breaking Rigidity in Super-Resolution
- Learning Large-Factor EM Image Super-Resolution with Generative Priors
- Super-Resolution Reconstruction from Bayer-Pattern Spike Streams
- Continuous Optical Zooming: A Benchmark for Arbitrary-Scale Image Super-Resolution in Real World
- Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary
⭐code - Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution
- SinSR: Diffusion-Based Image Super-Resolution in a Single Step
⭐code - CAMixerSR: Only Details Need More "Attention"
- Text-guided Explorable Image Super-resolution
- CFAT: Unleashing Triangular Windows for Image Super-resolution
- SeD: Semantic-Aware Discriminator for Image Super-Resolution
- Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts
- Boosting Flow-based Generative Super-Resolution Models via Learned Prior
⭐code - Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss
⭐code - AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution
⭐code - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer
- DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF超分辨率
- Neural Super-Resolution for Real-time Rendering with Radiance Demodulation
- Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
⭐code - Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning
- CoSeR: Bridging Image and Language for Cognitive Super-Resolution
⭐code
🏠project - Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super Resolution
- Bilateral Event Mining and Complementary for Event Stream Super-Resolution
- 盲图像超分辨率
- 真实世界超分辨率 Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution
- VSR
- Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution
- Enhancing Video Super-Resolution via Implicit Resampling-based Alignment
- Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution
🏠project - Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention
⭐code
- 文本图像超分
- Fair-VPT: Fair Visual Prompt Tuning for Image Classification
- Logarithmic Lenses: Exploring Log RGB Data for Image Classification
- SLICE: Stabilized LIME for Consistent Explanations for Image Classification
- Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
- MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes
- SURE: SUrvey REcipes for building reliable and robust deep networks
⭐code - A Bayesian Approach to OOD Robustness in Image Classification
- Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification
- Hyperspherical Classification with Dynamic Label-to-Prototype Assignment
⭐code - Discover and Mitigate Multiple Biased Subgroups in Image Classifiers
⭐code - Deep Imbalanced Regression via Hierarchical Classification Adjustment
- Large Language Models are Good Prompt Learners for Low-Shot Image Classification
⭐code - Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
- Bayesian Exploration of Pre-trained Models for Low-shot Image Classification
- Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model
- Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification
⭐code - In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification
- 域泛化图像分类
- 长尾识别
- 小样本图像分类
- 零样本分类
- 细粒度
- 开集分类
- 小样本识别
- GCD(广义类别发现)
- Matching Anything by Segmenting Anything
⭐code - Unsupervised Universal Image Segmentation
- MESA: Matching Everything by Segmenting Anything
- MRFS: Mutually Reinforcing Image Fusion and Segmentation
- RobustSAM: Segment Anything Robustly on Degraded Images
- Hierarchical Histogram Threshold Segmentation - Auto-terminating High-detail Oversegmentation
- Multi-Space Alignments Towards Universal LiDAR Segmentation
- CoralSCOP: Segment any COral Image on this Planet分割
- SANeRF-HQ: Segment Anything for NeRF in High Quality
🏠project - ASAM: Boosting Segment Anything Model with Adversarial Tuning
- ODIN: A Single Model for 2D and 3D Segmentation
⭐code - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything
👍摘要 - EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
- Universal Segmentation at Arbitrary Granularity with Language Instruction通用分割
- Segment and Caption Anything
🏠project - COCONut: Modernizing COCO Segmentation
⭐code - Multi-view Aggregation Network for Dichotomous Image Segmentation
⭐code - OMG-Seg: Is One Model Good Enough For All Segmentation?
🏠project - Unsegment Anything by Simulating Deformation
- BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
⭐code - VRP-SAM: SAM with Visual Reference Prompt
- PEM: Prototype-based Efficient MaskFormer for Image Segmentation
- Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM
⭐code - CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
🏠project - Benchmarking Segmentation Models with Mask-Preserved Attribute Editing
⭐code - CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers
- Continual Segmentation with Disentangled Objectness Learning and Class Recognition
⭐code - Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms
⭐code - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
⭐code
🏠project - Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model
- A Simple Recipe for Language-guided Domain Generalized Segmentation
🏠project - Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts
⭐code - Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation
⭐code
👍分割一切模型SAM泛化能力差?域适应策略给解决了 - 开放词汇分割
- Transferable and Principled Efficiency for Open-Vocabulary Segmentation
- USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation
- Open-Vocabulary Segmentation with Semantic-Assisted Calibration
⭐code - OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
- 视频分割
- UniVS: Unified and Universal Video Segmentation with Prompts as Queries
⭐code - Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence
🏠project视频分割 - Learning to Segment Referred Objects from Narrated Egocentric Videos
- Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
⭐code
- UniVS: Unified and Universal Video Segmentation with Prompts as Queries
- 语义分割
- Open Set Domain Adaptation for Semantic Segmentation
- ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention
- MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation
- TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation
- ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
- HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation
- Contextrast: Contextual Contrastive Learning for Semantic Segmentation
- Open-Set Domain Adaptation for Semantic Segmentation
- SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation
⭐code - Frequency-Adaptive Dilated Convolution for Semantic Segmentation
⭐code - GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation
- Improving Bird's Eye View Semantic Segmentation by Task Decomposition
⭐code - UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather
- Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
- Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball
- 3D 语义分割
- 点云语义分割
- 无监督语义分割
- 小样本语义分割
- 零样本语义分割
- 半监督语义分割
- Training Vision Transformers for Semi-Supervised Semantic Segmentation
- Density-Guided Semi-Supervised 3D Semantic Segmentation with Dual-Space Hardness Sampling
- AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation
⭐code - CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
⭐code - Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation
⭐code - RankMatch: Exploring the Better Consistency Regularization for Semi-supervised Semantic Segmentation
- 弱监督语义分割
- Class Tokens Infusion for Weakly Supervised Semantic Segmentation
- Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation
- DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation
⭐code - Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation
⭐code - Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation
⭐code - PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation
- From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation
- 域泛化语义分割
- Collaborating Foundation Models for Domain Generalized Semantic Segmentation
⭐code - Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning
⭐code - Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
⭐code
- Collaborating Foundation Models for Domain Generalized Semantic Segmentation
- 文本监督语义分割
- 开放世界语义分割
- 开放词汇语义分割
- Open-Vocabulary 3D Semantic Segmentation with Foundation Models
- Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
- CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
- Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation
- SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
⭐code - Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
⭐code
- 全景分割
- Semantics Distortion and Style Matter: Towards Source-free UDA for Panoramic Segmentation
- ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning
⭐code - PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
⭐code - Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations
⭐code
- 实例分割
- Extreme Point Supervised Instance Segmentation
- Mudslide: A Universal Nuclear Instance Segmentation Method
- Semantic-aware SAM for Point-Prompted Instance Segmentation
- SAI3D: Segment Any Instance in 3D Scenes
🏠project - DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data
- FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures
⭐code - Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Multi-Scale Aggregation and Anthropic Prior Knowledge
- 开放词汇实例分割
- 3D 实例分割
- BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation
⭐code - Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance
- Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior
- UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes
- BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation
- 场景分割
- 动作分割
- Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos
- Coherent Temporal Synthesis for Incremental Action Segmentation
- Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment
- Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
- FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation
⭐code
- 参考图像分割
- 指代表达式分割
- VOS
- Point-VOS: Pointing Up Video Object Segmentation
🏠project - Dual Prototype Attention for Unsupervised Video Object Segmentation
⭐code - Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
⭐code - Putting the Object Back into Video Object Segmentation
🏠project - Event-assisted Low-Light Video Object Segmentation
- Guided Slot Attention for Unsupervised Video Object Segmentation
- LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
⭐code - RMem: Restricted Memory Banks Improve Video Object Segmentation
- Point-VOS: Pointing Up Video Object Segmentation
- VSS
- VIS
- 抠图
- 少样本分割
- Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation
- Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation
⭐code - Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach
- Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation
- LLaFS: When Large Language Models Meet Few-Shot Segmentation
- Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining
⭐code - Addressing Background Context Bias in Few-Shot Segmentation through Iterative Modulation
- 零样本分割
- 裂纹分割
- 交互式分割
- 无模态分割
- 3D 分割
-
Learning Degradation-Independent Representations for Camera ISP Pipelines
-
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion
-
Event-based Visible and Infrared Fusion via Multi-task Collaboration
-
DemoCaricature: Democratising Caricature Generation with a Rough Sketch
-
PolarRec: Improving Radio Interferometric Data Reconstruction Using Polar Coordinates
-
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
-
Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing
-
Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning
-
Rotation-Agnostic Image Representation Learning for Digital Pathology
-
Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences
-
Multiview Aerial Visual RECognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?
-
Generative Proxemics: A Prior for 3D Social Interaction from Images
-
Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning
-
General Point Model Pretraining with Autoencoding and Autoregressive
-
Estimating Extreme 3D Image Rotations using Cascaded Attention
-
ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks
-
HIT: Estimating Internal Human Implicit Tissues from the Body Surface
-
CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation
-
Fooling Polarization-Based Vision using Locally Controllable Polarizing Projection
-
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
-
ReCoRe: Regularized Contrastive Representation Learning of World Model
-
Self-Calibrating Vicinal Risk Minimisation for Model Calibration
-
From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration
-
GLACE: Global Local Accelerated Coordinate Encoding
⭐code
🏠project -
Objects as Volumes: A Stochastic Geometry View of Opaque Solids
-
Efficient Model Stealing Defense with Noise Transition Matrix
-
OpenStreetView-5M: The Many Roads to Global Visual Geolocation
-
WaveMo: Learning Wavefront Modulations to See Through Scattering
-
All Rivers Run to the Sea: Private Learning with Asymmetric Flows
-
HDQMF: Holographic Feature Decomposition Using Quantum Algorithms
-
Cross-dimension Affinity Distillation for 3D EM Neuron Segmentation
-
READ: Retrieval-Enhanced Asymmetric Diffusion for Motion Planning
-
AssistGUI: Task-Oriented PC Graphical User Interface Automation
-
Towards Robust Learning to Optimize with Theoretical Guarantees
-
One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications
-
Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory
-
Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households
-
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
-
Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens
-
Versatile Navigation Under Partial Observability via Value-guided Diffusion Policy
-
DiffusionLight: Light Probes for Free by Painting a Chrome Ball
-
Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement
-
Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing
-
QUADify: Extracting Meshes with Pixel-level Details and Materials from Images
-
Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
-
E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator
-
Zero-Shot Structure-Preserving Diffusion Model for High Dynamic Range Tone Mapping
-
Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata
-
Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion
-
Partial-to-Partial Shape Matching with Geometric Consistency
-
Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding
-
Permutation Equivariance of Transformers and Its Applications
-
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
-
Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation
-
Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
-
Spatio-Temporal Turbulence Mitigation: A Translational Perspective
-
Finsler-Laplace-Beltrami Operators with Application to Shape Analysis
-
LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset
-
Learning to Navigate Efficiently and Precisely in Real Environments
-
Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling
-
Robust Self-calibration of Focal Lengths from the Fundamental Matrix
-
PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought
-
Flow-Guided Online Stereo Rectification for Wide Baseline Stereo
-
Don't Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion
-
Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification
-
Multimodal Representation Learning by Alternating Unimodal Adaptation多模态
-
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
-
Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering多模态
-
Efficient Hyperparameter Optimization with Adaptive Fidelity Identification
-
Multi-modal learning for geospatial vegetation forecasting
⭐code -
AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One
⭐code -
Practical Measurements of Translucent Materials with Inter-Pixel Translucency Prior
-
Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation
⭐code -
Batch Normalization Alleviates the Spectral Bias in Coordinate Networks
-
Affine Equivariant Networks Based on Differential Invariants
-
NC-TTT: A Noise Constrastive Approach for Test-Time Training
-
Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation
-
Revisiting Global Translation Estimation with Feature Tracks
🏠project -
Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning
-
A Theory of Joint Light and Heat Transport for Lambertian Scenes
-
A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction
-
MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation
-
Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning
-
Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance
-
Navigate Beyond Shortcuts: Debiased Learning Through the Lens of Neural Collapse
-
Latent Modulated Function for Computational Optimal Continuous Image Representation
-
Learning with Structural Labels for Learning with Noisy Labels
-
An N-Point Linear Solver for Line and Motion Estimation with Event Cameras
-
Not All Classes Stand on Same Embeddings: Calibrating a Semantic Distance with Metric Tensor
-
Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective
-
Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion
-
Adversarial Distillation Based on Slack Matching and Attribution Region Alignment
-
In Search of a Data Transformation That Accelerates Neural Field Training
-
SIRA: Scalable Inter-frame Relation and Association for Radar Perception
-
Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection
-
Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization
-
Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now
-
Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions
-
Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation
-
Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology
-
Task-Driven Wavelets using Constrained Empirical Risk Minimization
-
TurboSL: Dense Accurate and Fast 3D by Neural Inverse Structured Light
-
Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation
-
Robust Noisy Correspondence Learning with Equivariant Similarity Consistency
-
Scaling Laws for Data Filtering-- Data Curation cannot be Compute Agnostic
⭐code -
Learning to Rank Patches for Unbiased Image Redundancy Reduction
-
AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings
-
Differentiable Neural Surface Refinement for Modeling Transparent Objects
-
Communication-Efficient Collaborative Perception via Information Filling with Codebook
-
As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors
🏠project -
Fun with Flags: Robust Principal Directions via Flag Manifolds
-
Steerers: A Framework for Rotation Equivariant Keypoint Descriptors
⭐code -
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
-
PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors
-
Aligning and Prompting Everything All at Once for Universal Visual Perception
👍摘要 -
Backpropagation-free Network for 3D Test-Time Adaptation
⭐code -
Accept the Modality Gap: An Exploration in the Hyperbolic Space
-
Discontinuity-preserving Normal Integration with Auxiliary Edges
📺video -
1-Lipschitz Layers Compared: Memory Speed and Certifiable Robustness
⭐code -
Generating Non-Stationary Textures using Self-Rectification
⭐code -
Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories
🏠project -
Multi-agent Collaborative Perception via Motion-aware Robust Communication Network
-
Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning
-
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
🏠project -
DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting
⭐code降水临近预报 -
Intrinsic Image Diffusion for Indoor Single-view Material Estimation
🏠project室内单视图材料估计 -
UniPTS: A Unified Framework for Proficient Post-Training Sparsity
👍摘要 -
De-Diffusion Makes Text a Strong Cross-Modal Interface
🏠project -
ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations
⭐code -
Learning Structure-from-Motion with Graph Attention Networks
-
CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training
-
Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
-
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
-
Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems
-
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
⭐code -
Versatile Navigation under Partial Observability via Value-Guided Diffusion Policy
-
Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
-
Discovering and Mitigating Visual Biases through Keyword Explanation
-
PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment
-
Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching
⭐code数据压缩 -
L2B: Learning to Bootstrap Robust Models for Combating Label Noise
⭐code -
Revisiting Sampson Approximations for Geometric Estimation Problems
-
Real-Time Neural BRDF with Spherically Distributed Primitives
-
Uncertainty Visualization via Low-Dimensional Posterior Projections
-
Eclipse: Disambiguating Illumination and Materials using Unintended Shadows
🏠project -
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
⭐code -
Overcoming Generic Knowledge Loss with Selective Parameter Update
-
Fully Exploiting Every Real Sample: Super-Pixel Sample Gradient Model Stealing
-
Hierarchical Correlation Clustering and Tree Preserving Embedding
-
GLID: Pre-training a Generalist Encoder-Decoder Vision Model
-
SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement
-
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
👍摘要 -
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
⭐code -
Holodeck: Language Guided Generation of 3D Embodied AI Environments
-
Unified Entropy Optimization for Open-Set Test-Time Adaptation
-
IIRP-Net: Iterative Inference Residual Pyramid Network for Enhanced Image Registration图像配准
-
H-ViT: A Hierarchical Vision Transformer for Deformable Image Registration
-
Enhancing Multimodal Cooperation via Sample-level Modality Valuation
⭐code -
Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships
-
Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning
-
SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction
⭐code单图像树重建 -
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
-
MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation
-
Gradient-based Parameter Selection for Efficient Fine-Tuning
-
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
⭐code
🏠project -
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
-
Epistemic Uncertainty Quantification For Pre-trained Neural Network
-
Fooling Polarization-based Vision using Locally Controllable Polarizing Projection
-
A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion
-
Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation
⭐code -
DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning
⭐code -
Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users
-
Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models
-
CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition
-
Bayesian Differentiable Physics for Cloth Digitalization
⭐code -
DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models
-
Physical Property Understanding from Language-Embedded Feature Fields
⭐code -
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
-
NeISF: Neural Incident Stokes Field for Geometry and Material Estimation
-
Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis
-
Robust Depth Enhancement via Polarization Prompt Fusion Tuning
⭐code -
Dual-Scale Transformer for Large-Scale Single-Pixel Imaging
⭐code -
Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo
⭐code -
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
⭐code -
AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation
⭐code -
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
🏠project -
From Activation to Initialization: Scaling Insights for Optimizing Neural Fields
-
Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation
⭐code -
UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
⭐code -
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks
-
PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding
-
AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor
-
Sparse Views, Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo
-
MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction
⭐code -
Scalable 3D Registration via Truncated Entry-wise Absolute Residuals
-
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
⭐code
🏠project -
MedBN: Robust Test-Time Adaptation against Malicious Test Samples
-
Material Palette: Extraction of Materials from a Single Image
⭐code
🏠project -
Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks
-
Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training
⭐code -
Riemannian Multinomial Logistics Regression for SPD Neural Networks
⭐code -
A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network
⭐code -
Backpropagation-free Network for 3D Test-time Adaptation
⭐code -
Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning
⭐code -
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
⭐code -
Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery
-
Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis
-
PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution
-
Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture
⭐code -
LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels
⭐code -
EarthLoc: Astronaut Photography Localization by Indexing Earth from Space
⭐code -
SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks
-
UnO: Unsupervised Occupancy Fields for Perception and Forecasting
-
ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
-
PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos
-
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
🏠project -
Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner
-
Learning Object State Changes in Videos: An Open-World Perspective
🏠project -
U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
-
Improved Implicit Neural Representation with Fourier Reparameterized Training
-
AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis
-
Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing
-
Misalignment-Robust Frequency Distribution Loss for Image Transformation
⭐code -
Boosting Neural Representations for Videos with a Conditional Decoder
-
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
-
Unsupervised Feature Learning with Emergent Data-Driven Prototypicality
-
LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking
⭐code
👍LORS算法:低秩残差结构用于参数高效网络堆叠,参数少、成本低、内存小 -
HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction
-
Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
-
Desigen: A Pipeline for Controllable Design Template Generation
⭐code -
S2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering
⭐code -
Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer
-
Neural Refinement for Absolute Pose Regression with Feature Synthesis
⭐code
🏠project -
Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion
-
Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It
⭐code -
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
🏠project -
Unsupervised Deep Unrolling Networks for Phase Unwrapping相位展开
-
压缩感知
-
数据增强