Recent progress at the intersection of deep learning and imaging has created a new wave of interest in imaging and multimedia analytics topics, from social media sharing to augmented reality, from food and nutrition to health surveillance, from remote sensing and agriculture to wildlife and environment monitoring. Compared to many subjects in traditional imaging, these topics are more multi-disciplinary in nature. This conference will provide a forum for researchers and engineers from various related areas, both academic and industrial, to exchange ideas and share research results in this rapidly evolving field.
Generative Adversarial Networks (GAN) have been widely investigated for image synthesis based on their powerful representation learning ability. In this work, we explore the StyleGAN and its application of synthetic food image generation. Despite the impressive performance of GAN for natural image generation, food images suffer from high intra-class diversity and inter-class similarity, resulting in overfitting and visual artifacts for synthetic images. Therefore, we aim to explore the capability and improve the performance of GAN methods for food image generation. Specifically, we first choose StyleGAN3 as the baseline method to generate synthetic food images and analyze the performance. Then, we identify two issues that can cause performance degradation on food images during the training phase: (1) inter-class feature entanglement during multi-food classes training and (2) loss of high-resolution detail during image downsampling. To address both issues, we propose to train one food category at a time to avoid feature entanglement and leverage image patches cropped from high-resolution datasets to retain fine details. We evaluate our method on the Food-101 dataset and show improved quality of generated synthetic food images compared with the baseline. Finally, we demonstrate the great potential of improving the performance of downstream tasks, such as food image classification by including high-quality synthetic training samples in the data augmentation.
Food image classification is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning based methods learn the visual representation for food classification based on human annotation of each food image. However, most food images captured from real life are obtained without labels, requiring human annotation to train deep learning based methods. This approach is not feasible for real world deployment due to high costs. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning to learn the visual representation directly from unlabeled data. However, none of these existing works focuses on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on two items: the comparison of existing models and the development of an effective self-supervised learning model for food image classification. Specifically, we first compare the performance of existing state-of-the-art self-supervised learning models, including SimSiam, SimCLR, SwAV, BYOL, MoCo, and Rotation Pretext Task on food images. The experiments are conducted on the Food-101 dataset, which contains 101 different classes of foods with 1,000 images in each class. Next, we analyze the unique features of each model and compare their performance on food images to identify the key factors in each model that can help improve the accuracy. Finally, we propose a new model for unsupervised visual representation learning on food images for the classification task.
In this paper, we present a method for agglomerative clustering of characters in a video. Given a video edited with humans, we seek to identify each person with the character they represent. The proposed method is based on agglomerative clustering of deep face features, using first neighbour relations. First, the heads and faces of each person are detected and tracked in each shot of the video. Then, we create a feature vector of a tracked person in a shot. Finally, we compare the feature vectors and we use first neighbour relations to group them into distinct characters. The main contribution of this work is a person re-identification framework based on an agglomerative clustering method, and applied to edited videos with large scene variations.
Real-time video super-resolution (VSR) has been considered a promising solution to improving video quality for video conferencing and media video playing, which requires low latency and short inference time. Although state-of-the-art VSR methods have been proposed with well-designed architectures, many of them are not feasible to be transformed into a real-time VSR model because of vast computation complexity and memory occupation. In this work, we propose a light-weight recurrent network for this task, where motion compensation offset is estimated by an optical flow estimation network, features extracted from the previous high-resolution output are aligned to the current target frame, and a hidden space is utilized to propagate long-term information. We show that the proposed method is efficient in real-time video super-resolution. We also carefully study the effectiveness of the existence of an optical flow estimation module in a lightweight recurrent VSR model and compare two ways of training the models. We further compare four different motion estimation networks that have been used in light-weight VSR approaches and demonstrate the importance of reducing information loss in motion estimation.
Video conferencing usage dramatically increased during the pandemic and is expected to remain high in hybrid work. One of the key aspects of video experience is background blur or background replacement, which relies on good quality portrait segmentation in each frame. Software and hardware manufacturers have worked together to utilize depth sensor to improve the process. Existing solutions have incorporated depth map into post processing to generate a more natural blurring effect. In this paper, we propose to collect background features with the help of depth map to improve the segmentation result from the RGB image. Our results show significant improvements over methods using RGB based networks and runs faster than model-based background feature collection models.
Hand hygiene is essential for food safety and food handlers. Maintaining proper hand hygiene can improve food safety and promote public welfare. However, traditional methods of evaluating hygiene during food handling process, such as visual auditing by human experts, can be costly and inefficient compared to a computer vision system. Because of the varying conditions and locations of real-world food processing sites, computer vision systems for recognizing handwashing actions can be susceptible to changes in lighting and environments. Therefore, we design a robust and generalizable video system that is based on ResNet50 that includes a hand extraction method and a 2-stream network for classifying handwashing actions. More specifically, our hand extraction method eliminates the background and helps the classifier focus on hand regions under changing lighting conditions and environments. Our results demonstrate our system with the hand extraction method can improve action recognition accuracy and be more generalizable when evaluated on completely unseen data by achieving over 20% improvement on the overall classification accuracy.
Simulating the effects of skincare products on the face is a potential new mode for product self-promotion while facilitating consumers to choose the right product. Furthermore, such simulations enable one to anticipate her skin condition and better manage skin health. However, there is a lack of effective simulations today. In this paper, we propose the first simulation model to reveal facial pore changes after using skincare products. Our simulation pipeline consists of two steps: training data establishment and facial pore simulation. To establish training data, we collect face images with various pore quality indexes from short-term (8-weeks) clinical studies. People experience significant skin fluctuations (due to natural rhythms, external stressors, etc.,) which introduce large perturbations, and we propose a sliding window mechanism to clean data and select representative index(es) to present facial pore changes. The facial pore simulation stage consists of 3 modules: UNet-based segmentation module to localize facial pores; regression module to predict time-dependent warping hyperparameters; and deformation module, taking warping hyperparameters and pore segmentation labels as inputs, to precisely deform pores accordingly. The proposed simulation renders realistic facial pore changes. This work will pave the way for future research in facial skin simulation and skincare product developments.
We present a head-mounted holographic display system for thermographic image overlay, biometric sensing, and wireless telemetry. The system is lightweight and reconfigurable for multiple field applications, including object contour detection and enhancement, breathing rate detection, and telemetry over a mobile phone for peer-to-peer communication and incident commanding dashboard. Due to the constraints of the limited computing power of an embedded system, we developed a lightweight image processing algorithm for edge detection and breath rate detection, as well as an image compression codec. The system can be integrated into a helmet or personal protection equipment such as a face shield or goggles. It can be applied to firefighting, medical emergency response, and other first-response operations. Finally, we present a case study of "Cold Trailing" for forest fire prevention in the wild.
Ability to identify individual cows quickly and readily in the barn would enable real time monitoring of their behavior, health, eating habits and more, all of which could save time, money, or effort. This work focuses on creating an eidetic recognition or re-identification (ReID) algorithm that learns to recognize individual cows with just a single training example per cow and with near zero time to learn to identify a new cow, both features which the existing cattle ReID systems lack. Our algorithm is designed to improve recognition robustness to deformations in cow bodies that occur when they are walking, turning, or are seen slightly off-angle. Individual cows are first detected and localized using popular keypoint and mask detection techniques, then aligned to a fixed template, pixelated, binarized to reduce lighting effects, and serialized to obtain bit-vectors. Bit-vectors from cows at inference time are matched to those from training time using Hamming distance. To improve results, we add modules to verify the validity of detected keypoints, interpolate missing keypoints, and combine predictions from multiple frames using a majority vote. The video level accuracy is over 60% for a set of nearly 150 Holstein cows.