IS&T | Library

End-to-end evaluation of practical video analytics systems for face detection and recognition

Abstract

Automated extraction of intersection topologies from aerial and street-level images is relevant for Smart City traffic-control and safety applications. The intersection topology is expressed in the amount of approach lanes, the crossing (conflict) area, and the availability of painted striping for guidance and road delineation. Segmentation of road surface and other basic information can be obtained with 80% score or higher, but the segmentation and modeling of intersections is much more complex, due to multiple lanes in various directions and occlusion of the painted stripings. This paper addresses this complicated problem by proposing a dualistic channel model featuring direct segmentation and involving domain knowledge. These channels are developing specific features such as drive lines and lane information based on painted striping, which are filtered and then fused to determine an intersection-topology model. The algorithms and models are evaluated with two datasets, a large mixture of highway and urban intersections and a smaller dataset with intersections only. Experiments with measuring the GEO metric show that the proposed late-fusion system increases the recall score with 47 percentage points. This recall gain is consistent for using either aerial imagery or a mixture of aerial and street-level orthographic image data. The obtained recall for intersections is much lower than for highway data because of the complexity, occlusions by trees and the small amount of annotated intersections. Future work should aim at consolidating this model improvement at a higher recall level with more annotated data on intersections.

Digital Library: EI

Published Online: January 2024

Article

147 39

Video analytics systems
Computer vision
Driving-specific
Face detection and recognition
Video compression
Dataset imbalance
End-to-end performance
Task interdependencies

Praneet Singh, Edward J. Delp, Amy R. Reibman

DOI

10.2352/EI.2023.35.16.AVM-111

Volume 35

Issue 16

Evaluation of image quality metrics designed for DRI tasks with automotive cameras

Abstract

Practical video analytics systems that are deployed in bandwidth constrained environments like autonomous vehicles perform computer vision tasks such as face detection and recognition. In an end-to-end face analytics system, inputs are first compressed using popular video codecs like HEVC and then passed onto modules that perform face detection, alignment, and recognition sequentially. Previously, the modules of these systems have been evaluated independently using task-specific imbalanced datasets that can misconstrue performance estimates. In this paper, we perform a thorough end-to-end evaluation of a face analytics system using a driving-specific dataset, which enables meaningful interpretations. We demonstrate how independent task evaluations and dataset imbalances can overestimate system performance. We propose strategies to balance the evaluation dataset and to make its annotations consistent across multiple analytics tasks and scenarios. We then evaluate the end-to-end system performance sequentially to account for task interdependencies. Our experiments show that our approach provides a true estimate of the end-to-end performance for critical real-world systems.

Digital Library: EI

Published Online: January 2023

Article

121 60

Contrast Detection Probability (CDP)
Contrast Signal to Noise Ratio (CSNR)
Frequency of Correct Resolution (FCR)
Automotive camera
Computer vision
DXOMARK
Image quality evaluation

Valentine Klein, Theophanis Eleftheriou, Yiqi LI, Emilie Baudin, Claudio Greco, Laurent Chanas, Frédéric Guichard

DOI

10.2352/EI.2023.35.8.IQSP-309

Volume 35

Issue 8

Efficient high-dynamic-range depth map processing with reduced precision neural net accelerator

Abstract

Driving assistance is increasingly used in new car models. Most driving assistance systems are based on automotive cameras and computer vision. Computer Vision, regardless of the underlying algorithms and technology, requires the images to have good image quality, defined according to the task. This notion of good image quality is still to be defined in the case of computer vision as it has very different criteria than human vision: humans have a better contrast detection ability than image chains. The aim of this article is to compare three different metrics designed for detection of objects with computer vision: the Contrast Detection Probability (CDP) [1, 2, 3, 4], the Contrast Signal to Noise Ratio (CSNR) [5] and the Frequency of Correct Resolution (FCR) [6]. For this purpose, the computer vision task of reading the characters on a license plate will be used as a benchmark. The objective is to check the correlation between the objective metric and the ability of a neural network to perform this task. Thus, a protocol to test these metrics and compare them to the output of the neural network has been designed and the pros and cons of each of these three metrics have been noted.

Digital Library: EI

Published Online: January 2023

Article

69 9

Computer vision
Deep learning
Convolutional neural network
Vision processors
Neural net accelerators
Depth sensing

Peter van Beek, Chyuan-Tyng Wu, Avi Kalderon

Pages 126-1 - 126-5, January 2022, © Society for Imaging Science and Technology 2022

DOI

10.2352/EI.2022.34.16.AVM-126

Volume 34

Issue 16

Abstract

Depth sensing technology has become important in a number of consumer, robotics, and automated driving applications. However, the depth maps generated by such technologies today still suffer from limited resolution, sparse measurements, and noise, and require significant post-processing. Depth map data often has higher dynamic range than common 8-bit image data and may be represented as 16-bit values. Deep convolutional neural nets can be used to perform denoising, interpolation and completion of depth maps; however, in practical applications there is a need to enable efficient low-power inference with 8-bit precision. In this paper, we explore methods to process high-dynamic-range depth data using neural net inference engines with 8-bit precision. We propose a simple technique that attempts to retain signal-to-noise ratio in the post-processed data as much as possible and can be applied in combination with most convolutional network models. Our initial results using depth data from a consumer camera device show promise, achieving inference results with 8-bit precision that have similar quality to floating-point processing.

Digital Library: EI

Published Online: January 2022

Boosting computer vision performance by enhancing camera ISP

169 50

Image signal processors (ISP)
Computer vision
Deep learning
Convolutional neural networks (CNN)
Object detection
Face recognition
Stereo disparity estimation

Peter van Beek, Chyuan-Tyng (Roger) Wu, Baishali Chaudhury, Thomas R. Gardos

Pages 174-1 - 174-8, January 2021, © Society for Imaging Science and Technology 2021

DOI

10.2352/ISSN.2470-1173.2021.17.AVM-174

Volume 33

Issue 17

Traditional image signal processors (ISPs) are primarily designed and optimized to improve the image quality perceived by humans. However, optimal perceptual image quality does not always translate into optimal performance for computer vision applications. In [1], Wu et al. proposed a set of methods, termed VisionISP, to enhance and optimize the ISP for computer vision purposes. The blocks in VisionISP are simple, content-aware, and trainable using existing machine learning methods. VisionISP significantly reduces the data transmission and power consumption requirements by reducing image bit-depth and resolution, while mitigating the loss of relevant information. In this paper, we show that VisionISP boosts the performance of subsequent computer vision algorithms in the context of multiple tasks, including object detection, face recognition, and stereo disparity estimation. The results demonstrate the benefits of VisionISP for a variety of computer vision applications, CNN model sizes, and benchmark datasets.

Digital Library: EI

Published Online: January 2021

Evaluation of semi-frozen semi-fixed neural network for efficient computer vision inference

76 7

Computer vision
Convolutional neural network
Pedestrian detection
Semantic segmentation
Facial landmark detection
Neural network hardware
Efficient deep learning

Chyuan-Tyng Wu, Peter van Beek, Phillip Schmidt, Joao Peralta Moreira, Thomas R. Gardos

Pages 213-1 - 213-7, January 2021, © Society for Imaging Science and Technology 2021

DOI

10.2352/ISSN.2470-1173.2021.17.AVM-213

Volume 33

Issue 17

Deep neural networks have been utilized in an increasing number of computer vision tasks, demonstrating superior performance. Much research has been focused on making deep networks more suitable for efficient hardware implementation, for low-power and low-latency real-time applications. In [1], Isikdogan et al. introduced a deep neural network design that provides an effective trade-off between flexibility and hardware efficiency. The proposed solution consists of fixed-topology hardware blocks, with partially frozen/partially trainable weights, that can be configured into a full network. Initial results in a few computer vision tasks were presented in [1]. In this paper, we further evaluate this network design by applying it to several additional computer vision use cases and comparing it to other hardware-friendly networks. The experimental results presented here show that the proposed semi-fixed semi-frozen design achieves competitive performanc on a variety of benchmarks, while maintaining very high hardware efficiency.

Digital Library: EI

Published Online: January 2021

Industrial defect detection by comparison with reference 3D CAD model

112 28

Industrial defect detection
CAD
Computer vision
Deep Learning

Deangeli G. Neves, Guilherme A. S. Megeto, Augusto C. Valente, Qian Lin

DOI

10.2352/ISSN.2470-1173.2021.8.IMAWM-281

Volume 33

Issue 8

In this work, we propose a method that detects and segments manufacturing defects in objects using only RGB images. The method can be divided into three different integrated modules: object detection, pose estimation and defect segmentation. The first two modules are deep learning-based approaches and were trained exclusively with synthetic data generated with a 3D rendering engine. The first module, object detector, is based on the Mask R-CNN method and provides the classification and segmentation of the object of interest as the output. The second module, pose estimator, uses the category of the object and the coordinates of the detection as input to estimate the pose with 6 degrees-of-freedom with an autoencoder-based approach. Thereafter it is possible to render the reference 3D CAD model with the estimated pose over the detected object and compare the real object with its virtual model. The third and last step uses only image processing techniques, such as morphology operations and dense alignment, to compare the segmentation of the detected object from the first step, and the mask of the rendered object of the second step. The output is an image with the shape defects highlighted. We evaluate our method on a custom test set with the intersection over union metric, and our results indicate the method is robust to small imprecision from each module.

Digital Library: EI

Published Online: January 2021

IRIACV Conference Overview and Papers Program

26 1

Intelligent robots
Industrial inspection
Computer vision
Sensing and imaging techniques
Sensor fusion

DOI

10.2352/ISSN.2470-1173.2021.6.IRIACV-A06

Volume 33

Issue 6

Digital Library: EI

Published Online: January 2021

Detection and Characterization of Rumble Strips in Roadway Video Logs

50 3

Computer vision
Autonomy
Transportation Systems
Deep Learning
Segmentation

Deniz Aykac, Thomas Karnowski, Regina Ferrell, James S. Goddard

DOI

10.2352/ISSN.2470-1173.2020.6.IRIACV-050

Volume 32

Issue 6

State departments of transportation often maintain extensive “video logs” of their roadways that include signs, lane markings, as well as non-image-based information such as grade, curvature, etc. In this work we use the Roadway Information Database (RID), developed for the Second Strategic Highway Research Program, as a surrogate for a video log to design and test algorithms to detect rumble strips in the roadway images. Rumble strips are grooved patterns at the lane extremities designed to produce an audible queue to drivers who are in danger of lane departure. The RID contains 6,203,576 images of roads in six locations across the United States with extensive ground truth information and measurements, but the rumble strip measurements (length and spacing) were not recorded. We use an image correction process along with automated feature extraction and convolutional neural networks to detect rumble strip locations and measure their length and pitch. Based on independent measurements, we estimate our true positive rate to be 93% and false positive rate to be 10% with errors in length and spacing on the order of 0.09 meters RMS and 0.04 meters RMS. Our results illustrate the feasibility of this approach to add value to video logs after initial capture as well as identify potential methods for autonomous navigation.

Digital Library: EI

Published Online: January 2020

Automatic shadow detection using hyperspectral data for terrain classification

199 9

Hyperspectral vision
Scene Understanding
Autonomous driving
Computer vision
Machine learning
Shadow detection
Image processing

Christian Winkens, Veronika Adams, Dietrich Paulus

DOI

10.2352/ISSN.2470-1173.2019.15.AVM-031

Volume 31

Issue 15