This paper presents a new training model for orientation invariant object detection in aerial images by extending a deep learning based RetinaNet which is a single-stage detector based on feature pyramid networks and focal loss for dense object detection. Unlike R3Det which applies feature refinement to handle rotating objects, we proposed further improvement to cope with the densely arranged and class imbalance problems in aerial imaging, on three aspects: 1) All training images are traversed in each iteration instead of only one image in each iteration in order to cover all possibilities; 2) The learning rate is reduced if losses are not reduced; and 3) The learning rate is reduced if losses are not changed. The proposed method was calibrated and validated by comprehensive for performance evaluation and benchmarking. The experiment results demonstrate the significant improvement in comparison with R3Dec approach on the same data set. In addition to the well-known public data set DOTA for benchmarking, a new data set is also established by considering the balance between the training set and testing set. The map of losses which dropped down smoothly without jitter and overfitting also illustrates the advantages of the proposed newmodel.
A reliable method to estimate population sizes of wild turkeys (Meleagris gallopavo) using unmanned aerial vehicles and thermal video imaging data collected at several field sites in Texas is described. Automating the data processing of airborne survey videos provides a fast and reproducible way to count wild turkeys for wildlife management and conservation. A deep learning semantic segmentation pipeline is developed to detect and count roosting Rio-Grande wild turkeys (M.g. intermedia) which appear as small faint objects in drone-based thermal IR videos. The proposed approach to detect roosting turkeys that appear as small objects, relies on Mask R-CNN, a deep architecture semantic segmentation algorithm. This is followed by a post-processing data association and filtering (DAF) process for counting the number of roosting birds. DAF was used to eliminate false positives like rocks and other small bright objects, which often have noisy detections across temporally adjacent video frames, that can be filtered using appearance association and distance-based gating across time. Transfer learning was used to train the Mask R-CNN network by initializing using ImageNet weights. Drone-based thermal IR videos are extremely challenging due to the complexity of the natural environment including weather effects, occlusion of birds, terrain, trees, complex tree shapes, rocks, water and thermal inversion. The transect videos were collected at night at several times and altitudes to optimize data collection opportunities without disturbing the roosting turkeys. Preliminary performance evaluation using 280 video frames is promising.
In this paper, we demonstrate the use of a Conditional Generative Adversarial Networks (cGAN) framework for producing high-fidelity, multispectral aerial imagery using low-fidelity imagery of the same kind as input. The motivation behind is that it is easier, faster, and often less costly to produce low-fidelity images than high-fidelity images using the various available techniques, such as physics-driven synthetic image generation models. Once the cGAN network is trained and tuned in a supervised manner on a data set of paired low- and high-quality aerial images, it can then be used to enhance new, lower-quality baseline images of similar type to produce more realistic, high-fidelity multispectral image data. This approach can potentially save significant time and effort compared to traditional approaches of producing multispectral images.
Change detection in image pairs has traditionally been a binary process, reporting either “Change” or “No Change.” In this paper, we present LambdaNet, a novel deep architecture for performing pixel-level directional change detection based on a four class classification scheme. LambdaNet successfully incorporates the notion of “directional change” and identifies differences between two images as “Additive Change” when a new object appears, “Subtractive Change” when an object is removed, “Exchange” when different objects are present in the same location, and “No Change.” To obtain pixel annotated change maps for training, we generated directional change class labels for the Change Detection 2014 dataset. Our tests illustrate that LambdaNet would be suitable for situations where the type of change is unstructured, such as change detection scenarios in satellite imagery.
In this paper, we propose a new method for printed mottle defect grading. By training the data scanned from printed images, our deep learning method based on a Convolutional Neural Network (CNN) can classify various images with different mottle defect levels. Different from traditional methods to extract the image features, our method utilizes a CNN for the first time to extract the features automatically without manual feature design. Different data augmentation methods such as rotation, flip, zoom, and shift are also applied to the original dataset. The final network is trained by transfer learning using the ResNet-34 network pretrained on the ImageNet dataset connected with fully connected layers. The experimental results show that our approach leads to a 13.16% error rate in the T dataset, which is a dataset with a single image content, and a 20.73% error rate in a combined dataset with different contents.
Facial landmark localization plays a critical role in many face analysis tasks. In this paper, we present a novel local-global aggregate network (LGA-Net) for robust facial landmark localization of faces in the wild. The network consists of two convolutional neural network levels which aggregate local and global information for better prediction accuracy and robustness. Experimental results show our method overcomes typical problems of cascaded networks and outperforms state-of-the-art methods on the 300-W [1] benchmark.
The evolving algorithms for 2D facial landmark detection empower people to recognize faces, analyze facial expressions, etc. However, existing methods still encounter problems of unstable facial landmarks when applied to videos. Because previous research shows that the instability of facial landmarks is caused by the inconsistency of labeling quality among the public datasets, we want to have a better understanding of the influence of annotation noise in them. In this paper, we make the following contributions: 1) we propose two metrics that quantitatively measure the stability of detected facial landmarks, 2) we model the annotation noise in an existing public dataset, 3) we investigate the influence of different types of noise in training face alignment neural networks, and propose corresponding solutions. Our results demonstrate improvements in both accuracy and stability of detected facial landmarks.
There is a surging need across the world for protection against gun violence. There are three main areas that we have identified as challenging in research that tries to curb gun violence: temporal location of gunshots, gun type prediction and gun source (shooter) detection. Our task is gun source detection and muzzle head detection, where the muzzle head is the round opening of the firing end of the gun. We would like to locate the muzzle head of the gun in the video visually, and identify who has fired the shot. In our formulation, we turn the problem of muzzle head detection into two sub-problems of human object detection and gun smoke detection. Our assumption is that the muzzle head typically lies between the gun smoke caused by the shot and the shooter. We have interesting results both in bounding the shooter as well as detecting the gun smoke. In our experiments, we are successful in detecting the muzzle head by detecting the gun smoke and the shooter.
Image aesthetic assessment has always been regarded as a challenging task because of the variability of subjective preference. Besides, the assessment of a photo is also related to its style, semantic content, etc. Conventionally, the estimations of aesthetic score and style for an image are treated as separate problems. In this paper, we explore the inter-relatedness between the aesthetics and image style, and design a neural network that can jointly categorize image by styles and give an aesthetic score distribution. To this end, we propose a multi-task network (MTNet) with an aesthetic column serving as a score predictor and a style column serving as a style classifier. The angular-softmax loss is applied in training primary style classifiers to maximize the margin among classes in single-label training data; the semi-supervised method is applied to improve the network’s generalization ability iteratively. We combine the regression loss and classification loss in training aesthetic score. Experiments on the AVA dataset show the superiority of our network in both image attributes classification and aesthetic ranking tasks.