Computed Tomography (CT) is a non-invasive imaging technique that reconstructs cross-sectional images of scenes from a series of projections acquired at different angles. In applications such as airport security luggage screening, the presence of dense metal clutter causes beam hardening and streaking in the resulting conventionally formed images. These artifacts can lead to object splitting and intensity shading that make subsequent labeling and identification inaccurate. Conventional approaches to metal artifact reduction (MAR) have post-processed the artifact-filled images or interpolated the metal regions of the sinogram projection data. In this work, we examine the use of deep-learning-based methods to directly correct the observed sinogram projection data prior to reconstruction using a fully convolutional network (FCN). In contrast to existing learning-based CT artifact reduction work, we work completely in the sinogram domain and train a network over the entire sinogram (versus just local image patches). Since the information in sinograms pertaining to objects is non-local, patch-based methods are not well matched to the nature of CT data. The use of an FCN provides better computational scaling than historical perceptron-based approaches. Using a poly-energetic CT simulation, we demonstrate the potential of this new approach in mitigating metal artifacts in CT.
When enjoying video streaming services, users expect high video quality in various situations, including mobile phone connections with low bandwidths. Furthermore, the user's interest in consuming new large-size data content, such as high resolution/frame rate material or 360 degree videos, is gaining as well. To deal with such challenges, modern encoders adaptively reduce the size of the transmitted data. This in turn requires automated video quality monitoring solutions to ensure a sufficient quality of the material delivered. We present a no-reference video quality model; a model that does not require the original reference material, which is convenient for application in the field. Our approach uses a pretrained classification DNN in combination with hierarchical sub-image creation, some state-of-the-art features and a random forest model. Furthermore, the model can process UHD content and is trained on a large ground-truth data set, which is generated using a state-of-the-art full-reference model. The proposed model achieved a high quality prediction accuracy, comparable to a number of full-reference metrics. Thus our model is a proof-of-concept for a successful no-reference video quality estimation.
Abstraction in art often reflects human perception—areas of an artwork that hold the observer's gaze longest will generally be more detailed, while peripheral areas are abstracted, just as they are mentally abstracted by humans' physiological visual process. The authors' artistic abstraction tool, Salience Stylize, uses Deep Learning to predict the areas in an image that the observer's gaze will be drawn to, which informs the system about which areas to keep the most detail in and which to abstract most. The planar abstraction is done by a Random Forest Regressor, splitting the image into large planes and adding more detailed planes as it progresses, just as an artist starts with tonally limited masses and iterates to add fine details, then completed with our stroke engine. The authors evaluated the aesthetic appeal and effectiveness of the detail placement in the artwork produced by Salience Stylize through two user studies with 30 subjects.
We propose a deep learning method to retrieve the most similar 3D well-designed model that our system has seen before, given a rough 3D model or scanned 3D data. We can either use this retrieved model directly or use it as a reference to redesign it for various purposes. Our neural network consists of 3 different neural networks (sub-nets). The first neural network deals with object images (2D projection) and the other two deals with voxel representations of the 3D object. At the last stage, we combine the results of all 3 sub-nets to get the object classification. Furthermore, we use the second to last layer as a feature map to do the feature matching, and return a list of top N most similar well-designed 3D models.
Unmanned Aerial Vehicles (UAVs) gain popularity in a wide range of civilian and military applications. Such emerging interest is pushing the development of effective collision avoidance systems which are especially crucial in a crowded airspace setting. Because of cost and weight limitations associated with UAVs' payload, the optical sensors, simply digital cameras, are widely used for collision avoidance systems in UAVs. This requires moving object detection and tracking algorithms from a video, which can be run on board efficiently. In this paper, we present a new approach to detect and track UAVs from a single camera mounted on a different UAV. Initially, we estimate background motions via a perspective transformation model and then identify moving object candidates in the background subtracted image through deep learning classifier trained on manually labeled datasets. For each moving object candidates, we find spatio-temporal traits through optical flow matching and then prune them based on their motion patterns compared with the background. Kalman filter is applied on pruned moving objects to improve temporal consistency among the candidate detections. The algorithm was validated on video datasets taken from a UAV. Results demonstrate that our algorithm can effectively detect and track small UAVs with limited computing resources.
During recent years, deep learning methods have shown to be effective for image classification, localization and detection. Convolutional Neural Networks (CNN) are used to extract information from images and are the main element of modern machine learning and computer vision methods. CNNs can be used for logo detection and recognition. Logo detection consist on locate and recognize commercial brand logos within an image. These methods are useful in the areas of online brand management or ad placement. The performance of this methods is closely related on the quantity and the quality of the data, typically image/label pairs, used to train the CNNs. Collecting the pair of images and labels, commonly referred as ground truth, can be expensive and time consuming. Multiple techniques try to solve this problem by either transforming the available data using data augmentation methods or by creating new images from scratch or from other images using image synthesis methods. In this paper, we investigate the latter approach. We segment background images, extract depth information and then blend logo images accordingly in order to create new real looking images. This approach allows us to create an indefinite number of images with a minimum manual labeling effort. The synthetic images can later be used to train CNNs for logo detection and recognition.
High pressure die casting (HPDC) has been developed since the late nineteenth century for a breadth of manufacturing applications. The process forms molten metal into molds at high temperatures given a complex array of parameters and variables that are challenging to observe. We used a set of thermal cameras to capture imagery of the die used as a mold during its cooling process between part productions. This data was used to train a convolution neural network to assess the quality of the part just produced based on the thermal characteristics of the surface of the die. The system achieved 90% accuracy when distinguishing between parts that met quality standards and parts that did not.
Modern digital cameras have very limited dynamic range, which makes them unable to capture the full range of illumination in natural scenes. Since this prevents them from accurately photographing visible detail, researchers have spent the last two decades developing algorithms for high-dynamic range (HDR) imaging which can capture a wider range of illumination and therefore allow us to reconstruct richer images of natural scenes. The most practical of these methods are stack-based approaches which take a set of images at different exposure levels and then merge them together to form the final HDR result. However, these algorithms produce ghost-like artifacts when the scene has motion or the camera is not perfectly static. In this paper, we present an overview of state-of-the-art deghosting algorithms for stack-based HDR imaging and discuss some of the tradeoffs of each.
Convolutional neural networks (CNNs) have improved the field of computer vision in the past years and allow groundbreaking new and fast automatic results in various scenarios. However, the training effect of CNNs when only scarce data are available is not yet examined in detail. Transfer learning is a technique that helps overcoming training data shortage by adapting trained models to a different but related target task. We investigate the transfer learning performance of pre-trained CNN models on variably sized training datasets for binary classification problems, which resemble the discrimination between relevant and irrelevant content within a restricted context. This often plays a role in data triage applications such as screening seized storage devices for means of evidence. The evaluation of our work shows that even with a small number of training examples, the models can achieve promising performances of up to 96% accuracy. We apply those transferred models to data triage by using the softmax outputs of the models to rank unseen images according to their assigned probability of relevance. This provides a tremendous advantage in many application scenarios where large unordered datasets have to be screened for certain content.
The analysis of complex structured data like video has been a long-standing challenge for computer vision algorithms. Innovative deep learning architectures like Convolutional Neural Networks (CNNs), however are demonstrating remarkable performance in challenging image and video understanding tasks. In this work we propose a architecture for the automated detection of scored points during tennis matches. We explore two approaches based on CNNs for the analysis of video streams of broadcasted tennis games. We first explore the two-stream approach, which involves extracting features related to either pixel intensity values via the analysis of grayscale frames or the encoding of motion related information via optical flow. However, we explore the case of using higher order 3D CNN for simultaneously encoding both spatial and temporal correlations. Furthermore, we explore the late fusion of the individual stream in order to extract and encode both structural and motion spatio-temporal dynamics. We validate the merits of the proposed scheme using a novel manually annotated dataset created from publically available videos.