Automated extraction of intersection topologies from aerial and street-level images is relevant for Smart City traffic-control and safety applications. The intersection topology is expressed in the amount of approach lanes, the crossing (conflict) area, and the availability of painted striping for guidance and road delineation. Segmentation of road surface and other basic information can be obtained with 80% score or higher, but the segmentation and modeling of intersections is much more complex, due to multiple lanes in various directions and occlusion of the painted stripings. This paper addresses this complicated problem by proposing a dualistic channel model featuring direct segmentation and involving domain knowledge. These channels are developing specific features such as drive lines and lane information based on painted striping, which are filtered and then fused to determine an intersection-topology model. The algorithms and models are evaluated with two datasets, a large mixture of highway and urban intersections and a smaller dataset with intersections only. Experiments with measuring the GEO metric show that the proposed late-fusion system increases the recall score with 47 percentage points. This recall gain is consistent for using either aerial imagery or a mixture of aerial and street-level orthographic image data. The obtained recall for intersections is much lower than for highway data because of the complexity, occlusions by trees and the small amount of annotated intersections. Future work should aim at consolidating this model improvement at a higher recall level with more annotated data on intersections.
Practical video analytics systems that are deployed in bandwidth constrained environments like autonomous vehicles perform computer vision tasks such as face detection and recognition. In an end-to-end face analytics system, inputs are first compressed using popular video codecs like HEVC and then passed onto modules that perform face detection, alignment, and recognition sequentially. Previously, the modules of these systems have been evaluated independently using task-specific imbalanced datasets that can misconstrue performance estimates. In this paper, we perform a thorough end-to-end evaluation of a face analytics system using a driving-specific dataset, which enables meaningful interpretations. We demonstrate how independent task evaluations and dataset imbalances can overestimate system performance. We propose strategies to balance the evaluation dataset and to make its annotations consistent across multiple analytics tasks and scenarios. We then evaluate the end-to-end system performance sequentially to account for task interdependencies. Our experiments show that our approach provides a true estimate of the end-to-end performance for critical real-world systems.
Driving assistance is increasingly used in new car models. Most driving assistance systems are based on automotive cameras and computer vision. Computer Vision, regardless of the underlying algorithms and technology, requires the images to have good image quality, defined according to the task. This notion of good image quality is still to be defined in the case of computer vision as it has very different criteria than human vision: humans have a better contrast detection ability than image chains. The aim of this article is to compare three different metrics designed for detection of objects with computer vision: the Contrast Detection Probability (CDP) [1, 2, 3, 4], the Contrast Signal to Noise Ratio (CSNR) [5] and the Frequency of Correct Resolution (FCR) [6]. For this purpose, the computer vision task of reading the characters on a license plate will be used as a benchmark. The objective is to check the correlation between the objective metric and the ability of a neural network to perform this task. Thus, a protocol to test these metrics and compare them to the output of the neural network has been designed and the pros and cons of each of these three metrics have been noted.
Depth sensing technology has become important in a number of consumer, robotics, and automated driving applications. However, the depth maps generated by such technologies today still suffer from limited resolution, sparse measurements, and noise, and require significant post-processing. Depth map data often has higher dynamic range than common 8-bit image data and may be represented as 16-bit values. Deep convolutional neural nets can be used to perform denoising, interpolation and completion of depth maps; however, in practical applications there is a need to enable efficient low-power inference with 8-bit precision. In this paper, we explore methods to process high-dynamic-range depth data using neural net inference engines with 8-bit precision. We propose a simple technique that attempts to retain signal-to-noise ratio in the post-processed data as much as possible and can be applied in combination with most convolutional network models. Our initial results using depth data from a consumer camera device show promise, achieving inference results with 8-bit precision that have similar quality to floating-point processing.
Traditional image signal processors (ISPs) are primarily designed and optimized to improve the image quality perceived by humans. However, optimal perceptual image quality does not always translate into optimal performance for computer vision applications. In [1], Wu et al. proposed a set of methods, termed VisionISP, to enhance and optimize the ISP for computer vision purposes. The blocks in VisionISP are simple, content-aware, and trainable using existing machine learning methods. VisionISP significantly reduces the data transmission and power consumption requirements by reducing image bit-depth and resolution, while mitigating the loss of relevant information. In this paper, we show that VisionISP boosts the performance of subsequent computer vision algorithms in the context of multiple tasks, including object detection, face recognition, and stereo disparity estimation. The results demonstrate the benefits of VisionISP for a variety of computer vision applications, CNN model sizes, and benchmark datasets.
Deep neural networks have been utilized in an increasing number of computer vision tasks, demonstrating superior performance. Much research has been focused on making deep networks more suitable for efficient hardware implementation, for low-power and low-latency real-time applications. In [1], Isikdogan et al. introduced a deep neural network design that provides an effective trade-off between flexibility and hardware efficiency. The proposed solution consists of fixed-topology hardware blocks, with partially frozen/partially trainable weights, that can be configured into a full network. Initial results in a few computer vision tasks were presented in [1]. In this paper, we further evaluate this network design by applying it to several additional computer vision use cases and comparing it to other hardware-friendly networks. The experimental results presented here show that the proposed semi-fixed semi-frozen design achieves competitive performanc on a variety of benchmarks, while maintaining very high hardware efficiency.
In this work, we propose a method that detects and segments manufacturing defects in objects using only RGB images. The method can be divided into three different integrated modules: object detection, pose estimation and defect segmentation. The first two modules are deep learning-based approaches and were trained exclusively with synthetic data generated with a 3D rendering engine. The first module, object detector, is based on the Mask R-CNN method and provides the classification and segmentation of the object of interest as the output. The second module, pose estimator, uses the category of the object and the coordinates of the detection as input to estimate the pose with 6 degrees-of-freedom with an autoencoder-based approach. Thereafter it is possible to render the reference 3D CAD model with the estimated pose over the detected object and compare the real object with its virtual model. The third and last step uses only image processing techniques, such as morphology operations and dense alignment, to compare the segmentation of the detected object from the first step, and the mask of the rendered object of the second step. The output is an image with the shape defects highlighted. We evaluate our method on a custom test set with the intersection over union metric, and our results indicate the method is robust to small imprecision from each module.
State departments of transportation often maintain extensive “video logs” of their roadways that include signs, lane markings, as well as non-image-based information such as grade, curvature, etc. In this work we use the Roadway Information Database (RID), developed for the Second Strategic Highway Research Program, as a surrogate for a video log to design and test algorithms to detect rumble strips in the roadway images. Rumble strips are grooved patterns at the lane extremities designed to produce an audible queue to drivers who are in danger of lane departure. The RID contains 6,203,576 images of roads in six locations across the United States with extensive ground truth information and measurements, but the rumble strip measurements (length and spacing) were not recorded. We use an image correction process along with automated feature extraction and convolutional neural networks to detect rumble strip locations and measure their length and pitch. Based on independent measurements, we estimate our true positive rate to be 93% and false positive rate to be 10% with errors in length and spacing on the order of 0.09 meters RMS and 0.04 meters RMS. Our results illustrate the feasibility of this approach to add value to video logs after initial capture as well as identify potential methods for autonomous navigation.
Hyperspectral image classification has received more attention from researchers in recent years. Hyperspectral imaging systems utilize sensors, which acquire data mostly from the visible through the near infrared wavelength ranges and capture tens up to hundreds of spectral bands. Using the detailed spectral information, the possibility of accurately classifying materials is increased. Unfortunately conventional spectral cameras sensors use spatial or spectral scanning during acquisition which is only suitable for static scenes like earth observation. In dynamic scenarios, such as in autonomous driving applications, the acquisition of the entire hyperspectral cube in one step is mandatory. To allow hyperspectral classification and enhance terrain drivability analysis for autonomous driving we investigate the eligibility of novel mosaic-snapshot based hyperspectral cameras. These cameras capture an entire hyperspectral cube without requiring moving parts or line-scanning. The sensor is mounted on a vehicle in a driving scenario in rough terrain with dynamic scenes. The captured hyperspectral data is used for terrain classification utilizing machine learning techniques. A major problem, however, is the presence of shadows in captured scenes, which degrades the classification results. We present and test methods to automatically detect shadows by taking advantage of the near-infrared (NIR) part of spectrum to build shadow maps. By utilizing these shadow maps a classifier may be able to produce better results and avoid misclassifications due to shadows. The approaches are tested on our new hand-labeled hyperspectral dataset, acquired by driving through suburban areas, with several hyperspectral snapshotmosaic cameras.