Skip to content

Notes and summaries on depth estimation research papers

Notifications You must be signed in to change notification settings

horizon0408/depth_note

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 

Repository files navigation

Depth estimation research papers

25.05.2021: add depth completion

Table of Contents

  1. Monocular depth estimation
  2. Stereo depth estimation
  3. Multi-view depth estimation
  4. Depth completion

Monocular depth estimation

FastDepth: Fast Monocular Depth Estimation on Embedded Systems (ICRA 2019)

Enhancing self-supervised monocular depth estimation with traditional visual odometry (3DV 2019)
  • VO to obtained sparse 3D points

    • reproject 3D points onto both L/R camera planes to get sparse disparity map.
    • deploy two VO methods for training that exploit stereo and monocular sequences respectively
    • use ORB-SLAM2 for stereo VO (correct scale), Zenuity's pipeline for monocular VO(need scale recovery)
  • sparsity-invariant autoencoder (also check paper Sparsity Invariant CNNs)

    • sparse disparity map (SD) to denser priors (DD) for further estimation
    • final prediction d is a sum of DD'(from skip module) and D(from depth estimator)
  • self-supervised loss

    • use stereo image only for training (symmetric scheme), the inference use monocular input
    • apperance matching loss, disparity smoothness loss, left-right consistency loss
    • occlusion loss: minimize the sum of all disparities
    • inner loss: enforce DD to be consistent with SD (use L1 here)
    • outer loss: to preserve the info from VO, enforce final prediction d to be consistent with SD

3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers (ICCV 2019)

Object-Driven Multi-Layer Scene Decomposition From a Single Image (ICCV 2019)

Spatial Correspondence with Generative Adversarial Network: Learning Depth from Monocular Videos (ICCV 2019)

Self-Supervised Monocular Depth Hints

Digging Into Self-Supervised Monocular Depth Estimation

Stereo depth estimation

On Building an Accurate Stereo Matching System on Graphics Hardware

A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms(2002)

  • Traditional stereo methods generally perform 4 steps: matching cost computation; cost aggregation; disparity computation / optimization; disparity refinement.

(DispNet)A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

  • follow the archietecture of FlowNet
  • DispNetCorr
    • two images processed separately to conv2 and resulting features are correlated horizontally(1D).
    • compute the dot product, lead to single-channel correlation map for each disparity level

GC-Net: End-to-End Learning of Geometry and Context for Deep Stereo Regression (ICCV 2017)

  • cost volume
    • not simply concatenate left and right features, but concat across each disparity level (H, W, maxD+1, F)
    • use distance metric restricts the network to only learn relative representations between features, and cannot carry absolute feature representations through to cost volume.
  • use 3D convolutions to regularize the cost volume over height×width×disparity dimensions, get final regularized cost volume with size H×W×D.
  • Differentiable soft argmin
    • traditional argmin's results are discrete, no sub-pixel estimates and not differentiable.
    • convert cost volume to probability volume, firstly take the negative value and then use softmax
    • take the sum of each disparity, weighted by its normalized probability
    • rely on network’s regularization to produce probability distribution which is predominantly unimodal

PSMNet: Pyramid Stereo Matching Network(CVPR 2018)

  • Spatial Pyramid Pooling(SPP) Module
    • aims to incorporate context information by learning the relationship between an object and its sub-region.
  • 4D cost volume
    • concat left and right SPP feature maps across each disparity level (H, W, D, F)
  • Stacked hour-glass architecture for cost volume regularization
    • repeat top-down/bottom-up processing with intermediate supervision

Code details notes: I trained ScenesFlow 10 epochs with batch size = 4(A pair of images in size 256x512 consumed about 4GB GPU memory.), the training takes 24 hours; tried finetune with KITTI2015 300 epochs with batch size = 4, there are 160 training pairs so each epochs have 40 iters, which takes 4.44 hours.

Learning for Disparity Estimation through Feature Constancy (CVPR 2018)
  • iResNet (iterative residual prediction network), incorporate all steps into a single network
  • use feature consistency to identify the correctness of the initial disparity and then refine
  • refined disparity map considered as a new initial map,repeated until the improvement is small
  • implemented in CAFFE (https://github.com/leonzfa/iResNet)
Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains (CVPR 2018) Proposed a self-adaptation method to generalize a pre-trained deep stereo model to novel scenes
  • Scale diversity
    • passing an up-sampled stereo pair then down-sampling the result lead to more high-frequency details, but the performance won't keep improving with the increase of the up-sample rate.
    • up-sampling enable matching with sub-pixel accuracy, more details are taken into consideration
    • meanwhile, finer-scale input means smaller receptive filed, which leads to lack of non-local info
  • graph laplacian regularization
    • an adaptive metric as smoothness term
  • iterative regularization
    • given pretrained model and a set combine both synthetic dataset w/ GT and real pairs wo/ GT.
    • create 'gt' for real pairs by zooming, minimize the difference between current prediction and fine-grain prediction.
  • daily scenes from smartphones
    • 1900 pairs for training, 320 for validation, 320 for testing. collected resolution is 768×768.
    • the disparity is small
StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction (ECCV 2018)
  • low resolution cost volume
    • low resolution(1/8 or 1/16) lead to bigger receptive filed and compact feature vectors
    • most time is spent with higher resolutions while most performance gain from lower resolutions.
  • edge-aware hierarchical refinement
    • upsample the disparity bilinearly and concatenate with color
    • output is a 1D residual to be added to previous coarse prediction
  • real-time(60 fps)
ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (ECCV 2018) * active stereo + a textured is projected into the scene with an IR(红外) projector, and cameras are augmented to perceive IR and visible spectra.
  • photometric loss is poor

    • brighter pixels are closer(passive stereo won't suffer as the intensity and disparity won't have correlation)
    • brighter pixels are likely to have bigger residual than dark pixels.
  • Weighted Local Contrast Normalization(LCN)

    • remove the dependency between intensity and disparity, give better residual in occluded region
    • compute local mean and std in 9×9 patch to normalize the intensity
    • before re-weight, suffer in low texture regions(have small std that can amplify residual)
    • re-weight using std estimated on the reference image
  • adaptive support weight cost aggregation

    • traditional adaptive support sceheme, effective but slow
    • only integrate in training with 32×32 windlow
  • invalidation network

    • left-right check occlusion mask with enforcing regularization on the number of valid pixel
    • invalidation network also produce mask, which make inference faster
  • dataset

    • real dataset: collected from Intel Realsense D435 camera(10000/100 train/test), the camera have IR light source.
    • synthetic dataset: rendered by Blender(10000/1200)
GwcNet: Group-wise Correlation Stereo Network(CVPR 2019)
  • Construct cost volume by group-wise correlation

    • full correlation(DispNetC) lose information because only produces a single-channel correlation map for each disparity level
    • Divide channel into multiple groups, split features along channel dimension
    • left ith left group is cross-correlated with ith right group over all disparity levels(compute inner product as DispNetC)
    • packed correlations into matching cost volume (D/4, H/4, W/4, N_g)
  • Modified the stacked 3D hourglass networks

    • add extra output module and the extra loss lead to better features at lower layers
    • remove residual connection between output modules.
    • connections within each hourglass are added 1×1×1 3D conv
Multi-Level Context Ultra-Aggregation for Stereo Matching(CVPR 2019)
  • Formulate two aggregation schemes(DenseNet, DLA) with Higher Order RNNS.
    • DenseNets cannot merge features across scales and resolutions.
    • the fusion in DLA only refers to the intra-level combination.
  • Intra-level combination (divide into two groups according to the size of feature maps(1/2 or 1/4), fuse features in each group)
    • use 1×1 conv to match with each other, integrated by element-wise summation and pre-activated
  • Inter-level combination
    • an independent child module, firstly avg pooling to reduce the size by half(1/4), same architecture with the 1st group(1/2)
    • obtain large receptive fields at shallow stages
  • EMCUA
    • firstly train the model that MCUA is applied on the matching cost computation in PSMNet (2D-CNNs after SPP??)
    • secondly train EMCUA where a residual module is added at the end of MCUA
GA-Net: Guided Aggregation Net for End-to-end Stereo Matching(CVPR 2019)
  • Semi-global guided aggregation(SGA) layer

    • aims to solve occluded regions or large textureless/reflective regions
    • differentiable approximation of semi-global matching (SGM), which aggregates matching cost iteratively in four directions.
    • replace min selection with a weighted sum, internal min changed to max(aims max the probabilities at the ground truth depths instead of min the matching costs), keeps the best from only one direction
    • The SGA layer are much faster and more effective than 3D convolutions.
  • Local guided aggregation(LGA) layer

    • aims to refine the thin structures and object edges which may be blured by down-sampling and up-sampling easily
    • compare to traditional cost filter, it aggregates with a K×K×3 weight matrix in a K×K local region for each pixel.
Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence(CVPR 2019)
  • learn joint representations for tasks that are highly-related unsupervisedly with given stereo videos
    • share a single network for both flow estimation and stereo matching
  • forward-backward consistency check to find occluded regions for optical flow
  • 2-Warp consistency loss
    • warp image twice by both optical flow and stereo disparity
    • training in unsupervised setting, no gt optical flow, disparity map and camera poses provided.

Code notes: Looks rely on CUDA9.2: after download cuda9.2 toolkit and export LD_LIBRARY_PATH="/usr/local/cuda-9.2/lib64:$LD_LIBRARY_PATH" export PATH="/usr/local/cuda-9.2/bin:$PATH" it works.

StereoDRNet: Dilated Residual Stereo Net (CVPR 2019)
  • 3D Dilated Convolution in Cost Filtering
    • combine information fetched from varing receptive fields
  • Disparity refinement
    • warp right image to left view via D_r (photometric consistency)
    • warp right disparity D_r to left view via left disparity D_l (geometric consistency)
    • use error maps as parts of input of refinement network rather than as loss function.
  • Vortex Pooling better than SPP

Guided Stereo Matching (CVPR 2019) [demo code only]

  • use external sparse(< 5%) depth cues
    • to simulate the cues, randomly sample pixels from the ground truth disparity maps for both training and testing
  • feature enhancement
    • given a disparity value k, enhance the k-th channel output of a correlation layer or the k-th slice of a 4D volume.
    • to avoid replace a lot zero values, use a Gaussian function.
    • the Gaussian modulation applied after concatenating L/R features (2F, D, H, W)
Real-time self-adaptive deep stereo (CVPR) [code]
  • a fast modular architecture

    • at the lowest resoltion (F6), forward features from left to right into correlation layer(DispNetC here), decoder D6 get disparity map at lowest resolution.
    • upsample D6 to level 5, used for warping right features to left before computing correlation.
    • then the decoder D_k is to refine and correct the up-scaled disparity prediction
    • the correlation scores computed between original left and aligned right features guides the refinement process.
  • modular adaption

    • model is always in training mode and continuously fine-tuning to the sensed environment
    • grouping layers at the same resolution into a single module
    • optimize module independently, compute loss with prediction y_i and excute shorter backprop only across Module i
  • Reward/punishment selection (sample)

    • when deploying, need to sample a portion (from [1, p]) of the network to optimize for each incoming pair
    • create a histogram with p bins and apply softmax to obtain probability distribution to sample
    • To update the histogram, compute noisy expected value L_exp according to previous loss(L_{t-1}, L_{t-2})
    • change the value of histogram according to L_{exp} - L_t (>0 means effect)
    • the loss is based on photometric consistency loss, combination of L1 and SSIM

DeepPruner: Learning Efficient Stereo Matching via Differentiable PatchMatch(ICCV 19)

SegStereo: Exploiting Semantic Information for Disparity Estimation (ECCV18)
  • Model Specification
    • shallow part of ResNet-50 model to extract image features
    • PSPNet-50 as segmentation net
    • The weights in the shallow part and segmentation network are fixed when training
    • Disparity encoder behind hybrid volume contains 12 residual blocks
  • given shared representation, use segmentation network to compute semantic features for left/right respectively
  • concatenate both transformed left features(to preserve details), correlated features, left semantic features as hybrid volume
  • warped semantic consistency via semantic loss regularization
    • warp right semantic features to left based on predicted disparity map and use left segmentation GT to guide
    • propagates softmax loss back to disparity branch by warping
  • framework is appliable for both supervised/unsupervised training
    • the unsupervised loss introduce a mask indicator to avoid outlier, setting a threshold for resulting photometric diff
    • chabonier function for spatial smoothness penalty
Semantic Stereo Matching with Pyramid Cost Volumes (ICCV 19)
  • pyramid cost volumes for both semantic and spatial info
    • Unlike PSMNet use single cost volume with multiscale features, it construct multilevel cost volumes directly (btw the figure for spatial cost volume via spatial pooling is clear) (However, should lead to much higher complexity?)
    • semantic cost volume follows PSPNet
      • single, upsample feature maps to the same size and concatenate
  • 3D multi-cost aggregation with hourglass and 3D feature fusion module(FFM)
    • for each spatial cost volume, firstly a hourglass then upsampled for following fusion
    • fuse 4D spatial cost volumes from low to high level in a recursive way
    • FFM employ SE-block structure
  • boundary loss
    • disparity discontinuity point is always on the semantic boundaries
    • compute intensity gradient for GT segmentation labels and predicted disparity (align edges)
  • two-step training, train the segementation subnet firstly and then joint training the whole
    • For Scene Flow, have object-level segmentation GT, transform to segmentation labels
    • For KITTI2015/12, the semantic segmentation first trained with KITTI15 (have GT for left images)
Real-Time Semantic Stereo Matching
  • Both segmentation and disparity are fully computed only at the lowest resolution and progressively refined through the higher resolution residual stages(residual disparity), also applied in final refinement
    • by building cost volume at the lowest reso, dmax=12 is enough(correspond to 192 at full reso)
  • Synergy Disparity Refinement
    • previous work(SegStereo) use concatenate the two embeddings into a hybrid volume
    • perform a cascade of residual concatenations between semantic class probabilities and disparity volumes
  • since only GT instance segmentation in SceneFlow, initialize network on the CityScapes(disparity maps obtained via SGM, noisy)
AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks * use an AM module after the D-ResNet backbone to form the feature extractor (similar purpose like SPP) + a set of 3×3 dilated convolutions with increasing dilation factors (1,2,2,4,4,...,k/2,k/2,k), two 1×1 conv with dilation factor one are added at the end + dilated convs provide denser features than pooling + increase the receptive field and get denser multiscale contextual w/o losing the spatial resolution
  • Extended Cost Volume Aggregation

    • unlike others only use single volume, it concatenate three different cost volumes (kind of encode several distance metric res here), final size will be H×W×(D+1)×4C
    • disparitylevel feature concatenation
      • just concatenate L/R features lile GC-Net, PSMNet, get volume of size H×W×(D+1)×2C
    • Disparity-level feature distance
      • compute the point-wise absolute difference between L/R features at all disparity levels, get volume of size H×W×(D+1)×C
    • Disparity-level depthwise correlation
      • compute scalar product(like DispNetC) between L/R patches, get volume of size H×W×(D+1)×1
      • to make the size comparable(now the channel is only 1 here), implement depthwise correlation, which means compute the patch correlation for each channel (t is 0 in practical, which means the size of patch is 1×1 actually, so it's just two number's product for every channel), which finally get a volume of size H×W×(D+1)×C.
  • stacked AM modules to aggregate ECV (3D convs here because of the size of ECV)

On the Over-Smoothing Problem of CNN Based Disparity Estimation (ICCV 2019)

AANet: Adaptive Aggregation Network for Efficient Stereo Matching (CVPR 2020)

  • multi-scale cost volume (1/3, 1/6, 1/12): constructed via the DidspNetC way, aggregates all scale cost volumes simultaneouly (rather than fusion)
  • Adaptive Intra-Scale Aggregation: sparse-point based representation (for each pixel, sample a set of sparse points adaptively tp aggregate their cost), implemented via Deformable Convolution
  • Adaptive Cross-Scale Aggregation
  • six stacked AAModules, while for the first three use regular 2D convs for ISA, the last three use deformable convs
  • Two refinement modules proposed in StereoDRNet are used to hierarchically upsample the prediction
  • pseudo labels generated by GA-Net used for supervising pixels w/o GT

Multi-view depth estimation

DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion (CVPR 2021) [code]

  • use ConvLSTM and a hidden state propagation scheme to achieve temporal consistency
  • introduce inverse warping while propagating the hidden state, make the learning problem easier for the ConvLSTM cell

Neural RGB→D Sensing: Depth and Uncertainty from a Video Camera (CVPR 2019) [code]

  • use D-Net to learn the depth probability volume (DPV)
    • pre-define dmin, dmax and neighbour window size to learn DPV
    • warps the features from neighbour frames to the reference frame and compute a cost volume (L1/L2).
    • the confidence maps can be obtained from DPV
  • Apply Bayesian filter to integrate DPV over time
    • warp current DPV to 'predict' the DPV at t+1
    • to prevent wrong information propagate but also encourage correct information to be integrated, use K-Net to change the weight of 'prediction' adaptively
  • R-Net
    • upsample and refine the DPV to original resolution (1/4 before)

Point-Based Multi-View Stereo Network

Exploiting temporal consistency for real-time video depth estimation

Depth completion

Confidence Propagation through CNNs for Guided Sparse Depth Regression

Sparse-to-dense: Depth prediction from sparse depth samples and a single image (ICRA 2018)

  • During training, the input sparse depth is sampled randomly from the ground truth, takes both a sparse set of depth samples and RGB images as input
  • it shows results for application: Dense Map from Visual Odometry Features

Estimating depth from rgb and sparse sensing (ECCV 2018)

  • introduce new way of parameterizing sparse depth inputs, rather than input the sparse depth directly
    • a NN fill of the sparse depth map S1, then the densification can be regarded as a residual prediction w.r.t. S1, which can make learning easier
    • the Euclidean Distance Transform of the mask S2, which can provide prior on the residual magnitudes.
  • include the sparse depth map S1 at every DenseNet module

Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network (ECCV 2018)

Plug-and-Play: Improve Depth Prediction via Sparse Data Propagation (ICRA 2019)

  • requires no additional training, given a pre-trained depth prediction model, updates intermediate feature map
  • update the intermediate feature map iteratively based on the gradient computed from sparse points

LiStereo: Generate Dense Depth Maps from LIDAR and Stereo Imagery

DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points

  • instead of using Plane Sweep, firstly detecting and evaluating descriptors for interest points

    • SuperPoint-like network with detector and descriptor head
  • then learning to match and triangulate a small set of interest points

    • only search along the epipolar line
    • use bilinear sampling to obtain the descriptors at the desired points
  • densifying this sparse set of 3D points

Sparse-to-Dense Depth Completion Revisited: Sampling Strategy and Graph Construction

Learning Steering Kernels for Guided Depth Completion

Learning Joint 2D-3D Representations for Depth Completion (ICCV 2019)

3D LiDAR and Stereo Fusion using Stereo Matching Network with Conditional Cost Volume Normalization

others

Robust Consistent Video Depth Estimation

Unsupervised Learning of Multi-Frame Optical Flow with Occlusions (ECCV 2018)
  • three-frame temporal window
    • consider both past and future frames
  • occlusion estimation
    • consider three occlusion cases: visible in all frame, only occlusion in past frame, only occlusion is future frame
    • enforce the norm of occlusion variable for each pixel to be 1 with softmax function
    • use it to weight the contribution of future and past estimates
  • Two separate cost volumes
    • one for past and one for future, to detect occlusions
    • stack two cost volume as the input for all separate decoders
  • Two flow decoders
    • encourage constant velocity as a soft constraint
    • Under the constant velocity assumption, the future and past flow should be equal in length but differ in direction.
SENSE: a Shared Encoder Network for Scene-flow Estimation
  • shared encoder for 4 related tasks: optical flow/ disparity from stereo/ occlusion/ semantic segementation
    • inputsL two stereo images pairs (no camera pose needed)
    • build on top pf PWC-Net, encoder extracts features at different hierarchies, reduce pyramid level from 6 to 5
    • decoder for disparity
      • Pyramid Pooling Module(PPM) to aggregate learned features
      • add a hourglass, take twice up-sampled disparity, feature map, warped feature map to predict residual disparity
    • decoder for segmentation, use UPerNet
    • occlusion,add sibling branches to flow/disparity decoders to do pixel-wise binary classification
  • semi-supervised
    • distillation loss
      • pseudo GT for occlusion/segmentation provided by pre-trained model on other data
    • self-supervision loss
      • corresponding pixels have photometric consistency and semantic consistency(similar posterior probability)
      • add regularization terms for occlusion study
  • rigidity-based warped disparity refinement
    • select pixel as static by removing semantic level vehicle/pedestrian/cyclist/sky
    • estimate rigid flow induced by camera motion
      • use estimated transformation(motion) to estimate rigid flow which should be consistent with estimated flow in the background rigion
    • estimate warped second frame rigid disparity
      • use estimated transformation to get warped disparity of 2nd frame from 1st frame
      • then use the estimated forward flow to compute warped disparity of 2nd frame (suspect Eq.18 in Appendix B??)
SURGE: Surface Regularized Geometry Estimation from a Single Image

For single image depth estimation might depends on appearance information alone, so the surface geometry should help a lot here

  • a fourstream CNN, predict depths, surface normals, and likelihoods of planar region and planar boundary
  • a DCRF integrate 4 predictions
    • the field of variables to be optimized are depths and normals
PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image
  • piece-wise planar depthmap reconstruction requires a structured geometry representation
  • directly produce a set of plane parameters and probabilistic plane segmentation masks
    • Plane parameters, predict a fixed number of planar surfaces(K) for each scene, depth can be inferred from paramteres
      • don't know the number of planes, enable the corresponding probabilistic segmentation masks to be 0
      • order-agnostic loss function based on the Chamfer distance
    • Non-planar depthmap, regard as (K+1) th surface
    • Segmentation masks, probabilistic segmentation masks
      • joint train a DCRF module

Hierarchical Discrete Distribution Decomposition for Match Density Estimation

Learning Affinity via Spatial Propagation Networks (SPN)

Deep Layer Aggregation

Rethinking Atrous Convolution for Semantic Image Segmentation

About

Notes and summaries on depth estimation research papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published