Applications used in human-centered scene analysis often rely on AI processes that provide the 3D data of human bodies. The applications are limited by the accuracy and reliability of the detection. In case of safety applications, an almost perfect detection rate is required. The presented approach gives a confidence measure for the likelihood that detected human bodies are real persons. It measures the consistency of the estimated 3D pose information of body joints with prior knowledge about the physiologically possible spatial sizes and proportions. Therefore, a detailed analysis was done which lead to the development of an error metric that allows the quantitative evaluation of single limbs and in summary of the complete body. For a given dataset an error threshold has been derived that verifies 97% persons correctly and can be used for the identification of false detections, so-called ghosts. Additionally, the 3D-data of single joints could be rated successfully. The results are usable for relabeling and retraining of underlying 2D and 3D pose estimators and provides a quantitative and comparable verification method, which improves significantly a reliable 3D-recognition of real persons and increases hereby the possibilities of high-standard applications of 3D human-centered technologies.
We introduce a new image dataset for object detection and 6D pose estimation, named Extra FAT. The dataset consists of 825K photorealistic RGB images with annotations of groundtruth location and rotation for both the virtual camera and the objects. A registered pixel-level object segmentation mask is also provided for object detection and segmentation tasks. The dataset includes 110 different 3D object models. The object models were rendered in five scenes with diverse illumination, reflection, and occlusion conditions.
For tracking multiple targets in a scene, the most common approach is to represent the target in a bounding box and track the whole box as a single entity. However, in the case of humans, the body goes through complex articulation and occlusion that severely deteriorate the tracking performance. In this paper, we argue that instead of tracking the whole body of a target, if we focus on a relatively rigid body organ, better tracking results can be achieved. Based on this assumption, we followed the tracking-by-detection paradigm and generated the target hypothesis of only the spatial locations of heads in every frame. After the localization of head location, a constant velocity motion model is used for the temporal evolution of the targets in the visual scene. For associating the targets in the consecutive frames, combinatorial optimization is used that associates the corresponding targets in a greedy fashion. Qualitative results are evaluated on four challenging video surveillance dataset and promising results has been achieved.
We present a modified M-estimation based method for fast global 3D point cloud registration which rapidly converges to an optimal solution while matching or exceeding the accuracy of existing global registration methods.The key idea of our work is to introduce weighted median based M-estimation for re-weighted least squares adeployed in a graduated fashion which takes into account the error distribution of the residuals to achieve rapid convergence to an optimal solution. The experimental results on synthetic and real data sets show the significantly improved convergence of our method with a comparable accuracy with respect to the state-of-the-art global registration methods.