You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Status: Currently an OpenGL ES 2.0 compatible shader pipeline is used w/ platform specific optimized GPU->CPU input via the extended + hunterized ogles_gpgpu module [link] to facilitate real time face detection followed by face landmark and eye model regression stages. This supports a multi-resolution Aggregated Channel Feature texture/pyramid input for fast multi-scale gradient boosting object detection in the drishti::acf::Detector module via Piotr's matlab implmentation [link]. As reported in [link], this object detection approach is still reasonably close to state of the art for fairly unconstrained face detection (with the exception of large CNN approaches that are not so amenable to real time mobile device processing) -- good enough in any case. That face appearance is unique to fairly low resolution, and the ACF features effectively add 4x4 binning, so the input textures from the GPU are fairly low resolution, which helps minimize the GPU -> CPU overhead. The "pyramid" are also packed into texture to minimize wasted space [link]. In iOS land, texture retrieval via ios texture caches is very efficient, and multiple textures can be retrieved in parallel, so transfer overhead is fairly minimal [TODO: provide sample times on iPhones]. The OpenGL GraphicBuffer extensions are somewhat efficient, but do not seem to support parallel use, so there seems to be more GPU->CPU overhead for Android devices [TODO: sample times on Android].
The current drishti::hci::FaceFinder module introduces a 1 frame latency so that the required ACF pyramid texture computation can be kept busy, while a designated CPU thread retrieves and runs the reasonably lightweight multi-scale detection search. This processing uses sparse pixel lookups in combination with gradient boosting w/ level 2 trees for weak learners, and it is expected that such processing will be very slow on mobile GPU's. Most likely this will remain on the CPU to keep processing friendly for mobile devices. It might be worth adding some Renderscript/OpenCL/Metal unit tests to confirm this. There is a paper that investigates some clever approaches for running tree based gradient boosting on modern desktop GPU's w/ CUDA or OpenCL [LINK], but it is probably a stretch to envision this running on current mobile device GPU's.
Implementing ACF pyramids in pure OpenGL ES 2.0 shaders involves some compromise, mostly due to 8 32-bit 4x8-bit channel restrictions for output textures. For the ACF features (histograms, etc), this seems sufficient. Since there is no direct mapping from the ACF C++ code to OpenGL shaders, some (perhaps significant) numerical differences are expected. The current implementation provide a decent approximation, but the deviation is a little larger than I would like, although not too far off [#219]. Most likely some additional focused shader tuning will be required to make this 'close enough'. Packing float output in 32-bit output is one option, although I don't think it will be required for the detection application. Alternatively, I believe, higher precision could be achieved with platform specific GPU processing (OpenCL/RenderScript/Metal) [TOOD: cofirm], at the cost of higher transfer overhead -- again, I don't think this is worth it.
The low resolution ACF detection results (after NMS filtering) are scaled to higher resolution, and a face ROI is cropped from a grayscale texture to support fast landmark regression (again on the CPU) using a PCA variation of the Kazemi landmark estimation (dlib, dest, etc). Ideally this regression should leverage the richer/denser features in the ACF output, possibly using only single pixel tests, rather than the current normalized pixel differences. I expect this will be more accurate, faster, and will reduce the need for additional higher resolution raw/grayscale GPU->CPU texture downloads. Although the ACF features will violate the classic/traditional pose invariant features provided by pixel differences, the face pose is relatively constrained, and ACF features should provide a net win. PCA shape space regression, half precision floating point storage, and generic compression are all used to get the models down to a reasonable size (w/ a target of 2 or 3 MB). A WIP branch is currently exploring customized compression schemes that provide very large savings, based on redundancy in the concatenated leaf node values. Initial looks at time-sequence audio codecs, both lossless flac and lossy ogg vorbis look very encouraging [TODO: sample sizes]. This provides a rough initial landmark estimate for inner face features: eyes + eyebrows + nose. Those results are then used to initialize a final eye model fitting step: iris, pupil, eyelid contour, eyelid crease. That step itself proceeds in a coarse-to-fine model fitting sequence, with an initial global model being fit at low resolution, which is used to provide initial position and occlusion mask for a final iris ellipse estimate.
The landmark based head pose [LINK: eos] in combination with the eye/iris models will be used to feed a final CNN gaze estimate [LINKS] when sufficient training data is available [TODO/LINK: data collection process and online "research" datasets].
Miscellaneous:
SIMD (NEON/SSE) and GPU implementation of the ACF pyramids are available, but some of the low level routines would need to be implemented for a pure version C++ version [ACF w/o SIMD #111] -- in practice this is probably not needed.
Status: Currently an OpenGL ES 2.0 compatible shader pipeline is used w/ platform specific optimized GPU->CPU input via the extended + hunterized
ogles_gpgpu
module [link] to facilitate real time face detection followed by face landmark and eye model regression stages. This supports a multi-resolution Aggregated Channel Feature texture/pyramid input for fast multi-scale gradient boosting object detection in thedrishti::acf::Detector
module via Piotr's matlab implmentation [link]. As reported in [link], this object detection approach is still reasonably close to state of the art for fairly unconstrained face detection (with the exception of large CNN approaches that are not so amenable to real time mobile device processing) -- good enough in any case. That face appearance is unique to fairly low resolution, and the ACF features effectively add 4x4 binning, so the input textures from the GPU are fairly low resolution, which helps minimize the GPU -> CPU overhead. The "pyramid" are also packed into texture to minimize wasted space [link]. In iOS land, texture retrieval viaios texture caches
is very efficient, and multiple textures can be retrieved in parallel, so transfer overhead is fairly minimal [TODO: provide sample times on iPhones]. The OpenGLGraphicBuffer
extensions are somewhat efficient, but do not seem to support parallel use, so there seems to be more GPU->CPU overhead for Android devices [TODO: sample times on Android].The current
drishti::hci::FaceFinder
module introduces a 1 frame latency so that the required ACF pyramid texture computation can be kept busy, while a designated CPU thread retrieves and runs the reasonably lightweight multi-scale detection search. This processing uses sparse pixel lookups in combination with gradient boosting w/ level 2 trees for weak learners, and it is expected that such processing will be very slow on mobile GPU's. Most likely this will remain on the CPU to keep processing friendly for mobile devices. It might be worth adding some Renderscript/OpenCL/Metal unit tests to confirm this. There is a paper that investigates some clever approaches for running tree based gradient boosting on modern desktop GPU's w/ CUDA or OpenCL [LINK], but it is probably a stretch to envision this running on current mobile device GPU's.Implementing ACF pyramids in pure OpenGL ES 2.0 shaders involves some compromise, mostly due to 8 32-bit 4x8-bit channel restrictions for output textures. For the ACF features (histograms, etc), this seems sufficient. Since there is no direct mapping from the ACF C++ code to OpenGL shaders, some (perhaps significant) numerical differences are expected. The current implementation provide a decent approximation, but the deviation is a little larger than I would like, although not too far off [#219]. Most likely some additional focused shader tuning will be required to make this 'close enough'. Packing float output in 32-bit output is one option, although I don't think it will be required for the detection application. Alternatively, I believe, higher precision could be achieved with platform specific GPU processing (OpenCL/RenderScript/Metal) [TOOD: cofirm], at the cost of higher transfer overhead -- again, I don't think this is worth it.
The low resolution ACF detection results (after NMS filtering) are scaled to higher resolution, and a face ROI is cropped from a grayscale texture to support fast landmark regression (again on the CPU) using a PCA variation of the Kazemi landmark estimation (dlib, dest, etc). Ideally this regression should leverage the richer/denser features in the ACF output, possibly using only single pixel tests, rather than the current normalized pixel differences. I expect this will be more accurate, faster, and will reduce the need for additional higher resolution raw/grayscale GPU->CPU texture downloads. Although the ACF features will violate the classic/traditional pose invariant features provided by pixel differences, the face pose is relatively constrained, and ACF features should provide a net win. PCA shape space regression, half precision floating point storage, and generic compression are all used to get the models down to a reasonable size (w/ a target of 2 or 3 MB). A WIP branch is currently exploring customized compression schemes that provide very large savings, based on redundancy in the concatenated leaf node values. Initial looks at time-sequence audio codecs, both lossless
flac
and lossyogg vorbis
look very encouraging [TODO: sample sizes]. This provides a rough initial landmark estimate for inner face features: eyes + eyebrows + nose. Those results are then used to initialize a final eye model fitting step: iris, pupil, eyelid contour, eyelid crease. That step itself proceeds in a coarse-to-fine model fitting sequence, with an initial global model being fit at low resolution, which is used to provide initial position and occlusion mask for a final iris ellipse estimate.The landmark based head pose [LINK: eos] in combination with the eye/iris models will be used to feed a final CNN gaze estimate [LINKS] when sufficient training data is available [TODO/LINK: data collection process and online "research" datasets].
Miscellaneous:
The text was updated successfully, but these errors were encountered: