Image Processing: Algorithms and Systems continues the tradition of the past conference, Nonlinear Image Processing and Pattern Analysis, in exploring new image processing algorithms. Specifically, the conference aims at highlighting the importance of the interaction between transform-, model-, and learning-based approaches for creating effective algorithms and building modern imaging systems for new and emerging applications. It also reverberates the growing call for integration of the theoretical research on image processing algorithms with the more applied research on image processing systems.
Recently, various types of Video Inpainting models have been released. Video Inpainting is used to naturally erase the object you want to erase in the video. However, to use inpainting models, we usually need frames extracted from a video and masks and most people make these data manually. We propose a novel End-to-End Video Object Removal framework with Cropping Interested Region and Video Quality Assessment (ORCA). ORCA is built in an end-to-end way by combining the Detection, Segmentation, and Inpainting modules. The characteristics of proposed framework are going through the cropping step before inpainting step. In addition, We propose our own video quality assessment since ORCA use two models for inpainting. Our new metric indicates the higher quality of the results between two models. Experimental results show the superior performance of the proposed methods.
Anomalous behavior detection is a challenging research area within computer vision. One such behavior is throwing action in traffic flow, which is one of the unique requirements of our Smart City project to enhance public safety. This paper proposes a solution for throwing action detection in surveillance videos using deep learning. At present, datasets for throwing actions are not publicly available. To address the use-case of our Smart City project, we first generate the novel public 'Throwing Action' dataset, consisting of 271 videos of throwing actions performed by traffic participants, such as pedestrians, bicyclists, and car drivers, and 130 normal videos without throwing actions. Second, we compare the performance of different feature extractors for our anomaly detection method on the UCF-Crime and Throwing-Action datasets. Finally, we improve the performance of the anomaly detection algorithm by applying the Adam optimizer instead of Adadelta, and we propose a mean normal loss function that yields better anomaly detection performance. The experimental results reach an area under the ROC curve of 86.10 for the Throwing-Action dataset, and 80.13 on the combined UCF-Crime+Throwing dataset, respectively.
In the considered hybrid diffractive imaging system, a refractive lens is arranged simultaneously with a multilevel phase mask (MPM) as a diffractive optical element (DOE) for Achromatic Extended-depth-of-field (EDoF) imaging. This paper proposes a fully differentiable image formation model that uses neural network techniques to maximize the imaging quality by optimizing MPM, digital image reconstruction algorithm, refractive lens parameters (aperture size, focal length), and distance between the MPM and sensor. Firstly, model-based numerical simulations and end-to-end joint optimization of imaging are used. A spatial light modulator (SLM) is employed in the second stage of the design to implement MPM optimized at the first stage, and the image processing is optimized experimentally using a learning-based approach. The third stage of optimization is targeted at joint optimization of the SLM phase pattern and image reconstruction algorithm in the hardware-in-the-loop (HIL) setup, which allows compensation for a mismatch between numerical modeling and the physical reality of optic and sensor. A comparative analysis of the imaging accuracy and quality using the optical parameters is presented. It is proved experimentally, first time to the best of our knowledge, that wavefront phase modulation can provide imaging of advanced quality as compared with some commercial multi-lens cameras.
Deep learning, which has been very successful in recent years, requires a large amount of data. Active learning has been widely studied and used for decades to reduce annotation costs and now attracts lots of attention in deep learning. Many real-world deep learning applications use active learning to select the informative data to be annotated. In this paper, we first investigate laboratory settings for active learning. We show significant gaps between the results from different laboratory settings and describe our practical laboratory setting that reasonably reflects the active learning use cases in real-world applications. Then, we introduce a problem setting of blind imbalanced domains. Any data set includes multiple domains, e.g., individuals in handwritten character recognition with different social attributes. Major domains have many samples, and minor domains have few samples in the training set. However, we must accurately infer both major and minor domains in the test phase. We experimentally compare different methods of active learning for blind imbalanced domains in our practical laboratory setting. We show that a simple active learning method using softmax margin and a model training method using distance-based sampling with center loss, both working in the deep feature space, perform well.
Scientific user facilities present a unique set of challenges for image processing due to the large volume of data generated from experiments and simulations. Furthermore, developing and implementing algorithms for real-time processing and analysis while correcting for any artifacts or distortions in images remains a complex task, given the computational requirements of the processing algorithms. In a collaborative effort across multiple Department of Energy national laboratories, the "MLExchange" project is focused on addressing these challenges. MLExchange is a Machine Learning framework deploying interactive web interfaces to enhance and accelerate data analysis. The platform allows users to easily upload, visualize, label, and train networks. The resulting models can be deployed on real data while both results and models could be shared with the scientists. The MLExchange web-based application for image segmentation allows for training, testing, and evaluating multiple machine learning models on hand-labeled tomography data. This environment provides users with an intuitive interface for segmenting images using a variety of machine learning algorithms and deep-learning neural networks. Additionally, these tools have the potential to overcome limitations in traditional image segmentation techniques, particularly for complex and low-contrast images.
Emotions play an important role in our life as a response to our interactions with others, decisions, and so on. Among various emotional signals, facial expression is one of the most powerful and natural means for humans to convey their emotions and intentions, and it has the advantage of easily obtaining information using only a camera, so facial expression-based emotional research is being actively conducted. Facial expression recognition(FER) have been studied by classifying them into seven basic emotions: anger, disgust, fear, happiness, sadness, surprise, and normal. Before the appearance of deep learning, handcrafted feature extractors and simple classifiers such as SVM, Adaboost was used to extracted Facial emotion. With the advent of deep learning, it is now possible to extract facial expression without using feature extractors. Despite its excellent performance in FER research, it is still challenging task due to external factors such as occlusion, illumination, and pose, and similarity problems between different facial expressions. In this paper, we propose a method of training through a ResNet [1] and Visual Transformer [2] called FViT and using Histogram of Oriented Gradients(HOGs) [3] data to solve the similarity problem between facial expressions.
Face expressions understanding is a key to have a better understanding of the human nature. In this contribution we propose an end-to-end pipeline that takes color images as inputs and produces a semantic graph that encodes numerically what are facial emotions. This approach leverages low-level geometric details as face representation which are numerical representations of facial muscle activation patterns to build this emotional understanding. It shows that our method recovers social expectations of what characterize facial emotions.
Scale invariance and high miss detection rates for small objects are some of the challenging issues for object detection and often lead to inaccurate results. This research aims to provide an accurate detection model for crowd counting by focusing on human head detection from natural scenes acquired from publicly available datasets of Casablanca, Hollywood-Heads and Scut-head. In this study, we tuned a yolov5, a deep convolutional neural network (CNN) based object detection architecture, and then evaluated the model using mean average precision (mAP) score, precision, and recall. The transfer learning approach is used for fine-tuning the architecture. Training on one dataset and testing the model on another leads to inaccurate results due to different types of heads in different datasets. Another main contribution of our research is combining the three datasets into a single dataset, including every kind of head that is medium, large and small. From the experimental results, it can be seen that this yolov5 architecture showed significant improvements in small head detections in crowded scenes as compared to the other baseline approaches, such as the Faster R-CNN and VGG-16-based SSD MultiBox Detector.
Image classification is extensively used in various applications such as satellite imagery, autonomous driving, smartphones, and healthcare. Most of the images used to train classification models can be considered ideal, i.e., without any degradation either due to corruption of pixels in the camera sensors, sudden shake blur, or the compression of images in a specific format. In this paper, we have proposed a novel CNN-based architecture for image classification of degraded images based on intermediate layer knowledge distillation and data augmentation approach cutout named ILIAC. Our approach achieves 1.1%, and 0.4% mean accuracy improvements for all the degradation levels of JPEG and AWGN, respectively, compared to the current state-of-the-art approach. Furthermore, ILIAC method is efficient in computational capacity, i.e., about half the size of the previous state-of-the-art approach in terms of model parameters and GFlops count. Additionally, we demonstrate that we do not necessarily need a larger teacher network in knowledge distillation to improve the model performance and generalization of a smaller student network for the classification of degraded images.