Auto-Valet parking is a key emerging function for Advanced Driver Assistance Systems (ADAS) enhancing traditional surround view system providing more autonomy during parking scenario. Auto-Valet parking system is typically built using multiple HW components e.g. ISP, micro-controllers, FPGAs, GPU, Ethernet/PCIe switch etc. Texas Instrument’s new Jacinto7 platform is one of industry’s highest integrated SoC replacing these components with a single TDA4VMID chip. The TDA4VMID SoC can concurrently do analytics (traditional computer vision as well as deep learning) and sophisticated 3D surround view, making it a cost effective and power optimized solution. TDA4VMID is a truly heterogeneous architecture and it can be programmed using an efficient and easy to use OpenVX based middle-ware framework to realize distribution of software components across cores. This paper explains typical functions for analytics and 3D surround view in auto-valet parking system with data-flow and its mapping to multiple cores of TDA4VMID SoC. Auto-valet parking system can be realized on TDA4VMID SOC with complete processing offloaded of host ARM to the rest of SoC cores, providing ample headroom for customers for future proofing as well as ability to add customer specific differentiation.
Perspective transform (or Homography) is commonly used algorithms in ADAS and Automated Driving System. Perspective transform is used in multiple use-cases e.g. viewpoint change, fisheye lens distortion correction, chromatic aberration correction, stereo image pair rectification, This algorithm needs high external DRAM memory bandwidth due to inherent scaling, resulting in nonaligned two dimensional memory burst accesses, resulting in large degradation in system performance and latencies. In this paper, we propose a novel perspective transform engine to reduce external memory DRAM bandwidth to alleviate this problem. The proposed solution consists of multiple regions slicing of input video frame with block size tuned for each region. The paper also gives an algorithm for finding optimal region boundaries with corresponding block size tuned for each region. The proposed solution enables average BW reduction of 67% compared to traditional implementation and achieves clock up-to 720 MHz with output pixel throughput of 1 cycle/pixel in 16nm FinFET process node.
To avoid manual collections of a huge amount of labeled image data needed for training autonomous driving models, this paperproposes a novel automatic method for collecting image data with annotation for autonomous driving through a translation network that can transform the simulation CG images to real-world images. The translation network is designed in an end-to-end structure that contains two encoder-decoder networks. The forepart of the translation network is designed to represent the structure of the original simulation CG image with a semantic segmentation. Then the rear part of the network translates the segmentation to a realworld image by applying cGAN. After the training, the translation network can learn a mapping from simulation CG pixels to the realworld image pixels. To confirm the validity of the proposed system, we conducted three experiments under different learning policies by evaluating the MSE of the steering angle and vehicle speed. The first experiment demonstrates that the L1+cGAN performs best above all loss functions in the translation network. As a result of the second experiment conducted under different learning policies, it turns out that the ResNet architecture works best. The third experiment demonstrates that the model trained with the real-world images generated by the translation network can still work great in the real world. All the experimental results demonstrate the validity of our proposed method.
Imitation learning is used massively in autonomous driving for training networks to predict steering commands from frames using annotated data collected by an expert driver. Believing that the frames taken from a front-facing camera are completely mimicking the driver’s eyes raises the question of how eyes and the complex human vision system attention mechanisms perceive the scene. This paper proposes the idea of incorporating eye gaze information with the frames into an end-to-end deep neural network in the lane-following task. The proposed novel architecture, GG-Net, is composed of a spatial transformer network (STN), and a multitask network to predict steering angle as well as the gaze map for the input frame. The experimental results of this architecture show a great improvement in steering angle prediction accuracy of 36% over the baseline with inference time of 0.015 seconds per frame (66 fps) using NVIDIA K80 GPU enabling the proposed model to operate in real-time. We argue that incorporating gaze maps enhances the model generalization capability to the unseen environments. Additionally, a novel course-steering angle conversion algorithm with a complementing mathematical proof is proposed.
This paper reports the main conclusions of a fielding observation of vehicle-pedestrian interactions at urban crosswalks, by describing the types, sequences, spatial distributions and probabilities of occurrence of the vehicle and pedestrian behaviors. This study was motivated by the fact that in a near future, with the introduction of autonomous vehicles (AVs), human drivers will become mere passengers, no longer being able to participate into the traffic interactions. With the purpose of recreating the necessary interactions, there is a strong need of new communication abilities for AVs to express their status and intentions, especially to pedestrians who constitute the most vulnerable road users. As pedestrians highly rely on the actual behavioral mechanism to interact with vehicles, it looks preferable to take into account this mechanism in the design of new communication functions. In this study, through more than one hundred of video-recorded vehicle-pedestrian interaction scenes at urban crosswalks, eight scenarios were classified with respect to the different behavioral sequences. Based on the measured position of pedestrians relative to the vehicle at the time of the significant behaviors, quantitative analysis shows that distinct patterns exist for the pedestrian gaze behavior and the vehicle slowing down behavior as a function of Vehicle-to-Pedestrian (V2P) distance and angle.
Full driving automation imposes to date unmet performance requirements on camera and computer vision systems, in order to replace the visual system of a human driver in any conditions. So far, the individual components of an automotive camera hav mostly been optimized independently, or without taking into account the effect on the computer vision applications. We propose an end-to-end optimization of the imaging system in software, from generation of radiometric input data over physically based camera component models to the output of a computer vision system. Specifically, we present an optimization framework which extends the ISETCam and ISET3d toolboxes to create synthetic spectral data of high dynamic range, and which models a stateof-the-art automotive camera in more detail. It includes a stateof-the-art object detection system as benchmark application. We highlight in which way the framework approximates the physical image formation process. As a result, we provide guidelines for optimization experiments involving modification of the model parameters, and show how these apply to a first experiment on high dynamic range imaging.
Traditional image signal processors (ISPs) are primarily designed and optimized to improve the image quality perceived by humans. However, optimal perceptual image quality does not always translate into optimal performance for computer vision applications. In [1], Wu et al. proposed a set of methods, termed VisionISP, to enhance and optimize the ISP for computer vision purposes. The blocks in VisionISP are simple, content-aware, and trainable using existing machine learning methods. VisionISP significantly reduces the data transmission and power consumption requirements by reducing image bit-depth and resolution, while mitigating the loss of relevant information. In this paper, we show that VisionISP boosts the performance of subsequent computer vision algorithms in the context of multiple tasks, including object detection, face recognition, and stereo disparity estimation. The results demonstrate the benefits of VisionISP for a variety of computer vision applications, CNN model sizes, and benchmark datasets.
The demand for object tracking (OT) applications has been increasing for the past few decades in many areas of interest, including security, surveillance, intelligence gathering, and reconnaissance. Lately, newly-defined requirements for unmanned vehicles have enhanced the interest in OT. Advancements in machine learning, data analytics, and AI/deep learning have facilitated the improved recognition and tracking of objects of interest; however, continuous tracking is currently a problem of interest in many research projects. [1] In our past research, we proposed a system that implements the means to continuously track an object and predict its trajectory based on its previous pathway, even when the object is partially or fully concealed for a period of time. The second phase of this system proposed developing a common knowledge among a mesh of fixed cameras, akin to a real-time panorama. This paper discusses the method to coordinate the cameras' view to a common frame of reference so that the object location is known by all participants in the network.
Autonomous driving plays a crucial role to prevent accidents and modern vehicles are equipped with multimodal sensor systems and AI-driven perception and sensor fusion. These features are however not stable during a vehicle’s lifetime due to various means of degradation. This introduces an inherent, yet unaddressed risk: once vehicles are in the field, their individual exposure to environmental effects lead to unpredictable behavior. The goal of this paper is to raise awareness of automotive sensor degradation. Various effects exist, which in combination may have a severe impact on the AI-based processing and ultimately on the customer domain. Failure mode and effects analysis (FMEA) type approaches are used to structure a complete coverage of relevant automotive degradation effects. Sensors include cameras, RADARs, LiDARs and other modalities, both outside and in-cabin. Sensor robustness alone is a well-known topic which is addressed by DV/PV. However, this is not sufficient and various degradations will be looked at which go significantly beyond currently tested environmental stress scenarios. In addition, the combination of sensor degradation and its impact on AI processing is identified as a validation gap. An outlook to future analysis and ways to detect relevant sensor degradations is also presented.