In this paper, we present a deep-learning approach that unifies handwriting and scene-text detection in images. Specifically, we adopt adversarial domain generalization to improve text detection across different domains and extend the conventional dice loss to provide extra training guidance. Furthermore, we build a new benchmark dataset that comprehensively captures various handwritten and scene text scenarios in images. Our extensive experimental results demonstrate the effectiveness of our approach in generalizing detection across both handwriting and scene text.
Optimizing exposure time for low light scenarios involves a trade-off between motion blur and signal to noise ratio. A method for defining the optimum exposure time for a given function has not been described in the literature. This paper presents the design of a simulation of motion blur and exposure time from the perspective of a real-world camera. The model incorporates characteristics of real-world cameras including the light level (quanta), shot noise and lens distortion. In our simulation, an image quality target chart called the Siemens Star chart will be used, and the simulation outputs a blurred image as if captured from a camera of set exposure and set movement speed. The resulting image is then processed in Imatest in which image quality readings will be extracted from the image and consequently the relationship between exposure time, motion blur and the image quality metrics can be evaluated.
The goal of our work is to design an automotive platform for AD/ADAS data acquisition and to demonstrate its application to behavior analysis of vulnerable road users. We present a novel data capture platform mounted on a Mercedes GLC vehicle. The car is equipped with an array of sensors and recording hardware including multiple RGB cameras, Lidar, GPS and IMU. For subsequent research on human behavior analysis in traffic scenes, we have conducted two kinds of data recordings. Firstly, we have designed a range of artificial test cases which we recorded on a safety regulated proving ground with stunt persons to capture rare events in traffic scenes in a predictable and structured way. Secondly, we have recorded data on public streets of Vienna, Austria, showing unconstrained pedestrian behavior in an urban setting, while also considering European General Data Protection Regulation (GDPR) requirements. We describe the overall framework including data acquisition and ground truth annotation, and demonstrate its applicability for the implementation and evaluation of selected deep learning models for pedestrian behavior prediction.
We have developed an assistive technology for people with vision disabilities of central field loss (CFL) and low contrast sensitivity (LCS). Our technology includes a pair of holographic AR glasses with enhanced image magnification and contrast, for example, highlighting objects, and detecting signs, and words. In contrast to prevailing AR technologies which project either mixed reality objects or virtual objects to the glasses, Our solution fuses real-time sensory information and enhances images from reality. The AR glasses technology has two advantages: it’s relatively ‘fail-safe.” If the battery dies or the processor crashes, the glasses can still function because it is transparent. The AR glasses can also be transformed into a VR or AR simulator when it overlays virtual objects such as pedestrians or vehicles onto the glasses for simulation. The real-time visual enhancement and alert information are overlaid on the transparent glasses. The visual enhancement modules include zooming, Fourier filters, contrast enhancement, and contour overlay. Our preliminary tests with low-vision patients show that the AR glass indeed improved patients' vision and mobility, for example, from 20/80 to 20/25 or 20/30.
This paper presents AInBody, a novel deep learning-based body shape measurement solution. We have devised a user-centered design that automatically tracks the progress of the body by adequately integrating various methods, including human parsing, instance segmentation, and image matting. Our system guides a user's pose when taking photos by displaying the outline of the latest picture of the user, divides the human body into several parts, and compares before and after photos of the body part level. The parsing performance has been improved through an ensemble approach and a denoising phase in our main module, Advanced Human Parser. In evaluation, the proposed method is 0.1% to 4.8% better than the other best-performing model in average precision in 3 out of 5 parts, and 1.4% and 2.4% superior in mAP and mean IoU, respectively. Furthermore, the inference time of our framework takes approximately three seconds to process one HD image, demonstrating that our structure can be applied to real-time applications.
Object detection using aerial drone imagery has received a great deal of attention in recent years. While visible light images are adequate for detecting objects in most scenarios, thermal cameras can extend the capabilities of object detection to night-time or occluded objects. As such, RGB and Infrared (IR) fusion methods for object detection are useful and important. One of the biggest challenges in applying deep learning methods to RGB/IR object detection is the lack of available training data for drone IR imagery, especially at night. In this paper, we develop several strategies for creating synthetic IR images using the AIRSim simulation engine and CycleGAN. Furthermore, we utilize an illumination-aware fusion framework to fuse RGB and IR images for object detection on the ground. We characterize and test our methods for both simulated and actual data. Our solution is implemented on an NVIDIA Jetson Xavier running on an actual drone, requiring about 28 milliseconds of processing per RGB/IR image pair.
Skeleton based action recognition is playing a critical role in computer vision research, its applications have been widely deployed in many areas. Currently, benefiting from the graph convolutional networks (GCN), the performance of this task is dramatically improved due to the powerful ability of GCN for modeling the Non-Euclidean data. However, most of these works are designed for the clean skeleton data while one unavoidable drawback is such data is usually noisy in reality, since most of such data is obtained using depth camera or even estimated from RGB camera, rather than recorded by the high quality but extremely costly Motion Capture (MoCap) [1] system. Under this circumstance, we propose a novel GCN framework with adversarial training to deal with the noisy skeleton data. With the guiding of the clean data in the semantic level, a reliable graph embedding can be extracted for noisy skeleton data. Besides, a discriminator is introduced such that the feature representation could further improved since it is learned with an adversarial learning fashion. We empirically demonstrate the proposed framework based on two current largest scale skeleton-based action recognition datasets. Comparison results show the superiority of our method when compared to the state-of-the-art methods under the noisy settings.
A novel acceleration strategy is presented for computer vision and machine learning field from both algorithmic and hardware implementation perspective. With our approach, complex mathematical functions such as multiplication can be greatly simplified. As a result, an accelerated machine learning method requires no more than ADD operations, which tremendously reduces processing time, hardware complexity and power consumption. The applicability is illustrated by going through a machine learning example of HOG+SVM, where the accelerated version achieves comparable accuracy based on real datasets of human figure and digits.
Camera-based advanced driver-assistance systems (ADAS) require the mapping from image coordinates into world coordinates to be known. The process of computing that mapping is geometric calibration. This paper provides a series of tests that may be used to assess the goodness of the geometric calibration and compare model forms: 1. Image Coordinate System Test: Validation that different teams are using the same image coordinates. 2. Reprojection Test: Validation of a camera’s calibration by forward projecting targets through the model onto the image plane. 3. Projection Test: Validation of a camera’s calibration by inverse projecting points through the model out into the world. 4. Triangulation Test: Validation of a multi-camera system’s ability to locate a point in 3D. The potential configurations for these tests are driven by automotive use cases. These tests enable comparison and tuning of different calibration models for an as-built camera.
With the growing demand for robust object detection algorithms in self-driving systems, it is important to consider the varying lighting and weather conditions in which cars operate all year round. The goal of our work is to gain a deeper understanding of meaningful strategies for selecting and merging training data from currently available databases and self-annotated videos in the context of automotive night scenes. We retrain an existing Convolutional Neural Network (YOLOv3) to study the influence of different training dataset combinations on the final object detection results in nighttime and low-visibility traffic scenes. Our evaluation shows that a suitable selection of training data from the GTSRD, VIPER, and BDD databases in conjunction with selfrecorded night scenes can achieve an mAP of 63,5% for ten object classes, which is an improvement of 16,7% when compared to the performance of the original YOLOv3 network on the same test set.