IS&T | Library

Abstract

Recent progress at the intersection of deep learning and imaging has created a new wave of interest in imaging and multimedia analytics topics, from social media sharing to augmented reality, from food and nutrition to health surveillance, from remote sensing and agriculture to wildlife and environment monitoring. Compared to many subjects in traditional imaging, these topics are more multi-disciplinary in nature. This conference will provide a forum for researchers and engineers from various related areas, both academic and industrial, to exchange ideas and share research results in this rapidly evolving field.

Digital Library: EI

Published Online: January 2023

Conditional synthetic food image generation

114 46

Generative Adversarial Networks
Conditional Image Synthesis
Food Image Generation
Deep Learning

Wenjin Fu, Yue Han, Jiangpeng He, Sriram Baireddy, Mridul Gupta, Fengqing Zhu

DOI

10.2352/EI.2023.35.7.IMAGE-268

Volume 35

Issue 7

Abstract

Generative Adversarial Networks (GAN) have been widely investigated for image synthesis based on their powerful representation learning ability. In this work, we explore the StyleGAN and its application of synthetic food image generation. Despite the impressive performance of GAN for natural image generation, food images suffer from high intra-class diversity and inter-class similarity, resulting in overfitting and visual artifacts for synthetic images. Therefore, we aim to explore the capability and improve the performance of GAN methods for food image generation. Specifically, we first choose StyleGAN3 as the baseline method to generate synthetic food images and analyze the performance. Then, we identify two issues that can cause performance degradation on food images during the training phase: (1) inter-class feature entanglement during multi-food classes training and (2) loss of high-resolution detail during image downsampling. To address both issues, we propose to train one food category at a time to avoid feature entanglement and leverage image patches cropped from high-resolution datasets to retain fine details. We evaluate our method on the Food-101 dataset and show improved quality of generated synthetic food images compared with the baseline. Finally, we demonstrate the great potential of improving the performance of downstream tasks, such as food image classification by including high-quality synthetic training samples in the data augmentation.

Digital Library: EI

Published Online: January 2023

Self-supervised visual representation learning on food images

79 26

Image-based Dietary Assessment
Deep Learning
Visual Representation Learning
Food Image Classification
Self-supervised Learning

Andrew W. Peng, Jiangpeng He, Fengqing Zhu

DOI

10.2352/EI.2023.35.7.IMAGE-269

Volume 35

Issue 7

Abstract

Food image classification is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning based methods learn the visual representation for food classification based on human annotation of each food image. However, most food images captured from real life are obtained without labels, requiring human annotation to train deep learning based methods. This approach is not feasible for real world deployment due to high costs. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning to learn the visual representation directly from unlabeled data. However, none of these existing works focuses on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on two items: the comparison of existing models and the development of an effective self-supervised learning model for food image classification. Specifically, we first compare the performance of existing state-of-the-art self-supervised learning models, including SimSiam, SimCLR, SwAV, BYOL, MoCo, and Rotation Pretext Task on food images. The experiments are conducted on the Food-101 dataset, which contains 101 different classes of foods with 1,000 images in each class. Next, we analyze the unique features of each model and compare their performance on food images to identify the key factors in each model that can help improve the accuracy. Finally, we propose a new model for unsupervised visual representation learning on food images for the classification task.

Digital Library: EI

Published Online: January 2023

Movie character re-identification by agglomerative clustering of deep features

49 9

Deep face features
Person Re-identification
Agglomerative clustering
Movie character identification

Samuel Ducros, Gérard Subsol, Mathieu Lafourcade, Jean-Marie Barthélémy, William Puech

DOI

10.2352/EI.2023.35.7.IMAGE-271

Volume 35

Issue 7

Abstract

In this paper, we present a method for agglomerative clustering of characters in a video. Given a video edited with humans, we seek to identify each person with the character they represent. The proposed method is based on agglomerative clustering of deep face features, using first neighbour relations. First, the heads and faces of each person are detected and tracked in each shot of the video. Then, we create a feature vector of a tracked person in a shot. Finally, we compare the feature vectors and we use first neighbour relations to group them into distinct characters. The main contribution of this work is a person re-identification framework based on an agglomerative clustering method, and applied to edited videos with large scene variations.

Digital Library: EI

Published Online: January 2023

Light-weight recurrent network for real-time video super-resolution

99 52

Video Super-Resolution
Recurrent Neural Network
Wide Activation

Tianqi Wang, Jan P. Allebach, Qian Lin

DOI

10.2352/EI.2023.35.7.IMAGE-272

Volume 35

Issue 7

Abstract

Real-time video super-resolution (VSR) has been considered a promising solution to improving video quality for video conferencing and media video playing, which requires low latency and short inference time. Although state-of-the-art VSR methods have been proposed with well-designed architectures, many of them are not feasible to be transformed into a real-time VSR model because of vast computation complexity and memory occupation. In this work, we propose a light-weight recurrent network for this task, where motion compensation offset is estimated by an optical flow estimation network, features extracted from the previous high-resolution output are aligned to the current target frame, and a hidden space is utilized to propagate long-term information. We show that the proposed method is efficient in real-time video super-resolution. We also carefully study the effectiveness of the existence of an optical flow estimation module in a lightweight recurrent VSR model and compare two ways of training the models. We further compare four different motion estimation networks that have been used in light-weight VSR approaches and demonstrate the importance of reducing information loss in motion estimation.

Digital Library: EI

Published Online: January 2023

Depth assisted portrait video background blurring

130 34

Video Portrait Segmentation
Real-time
Depth map
Segmentation Refiner
Artificial Intelligence
Privacy
Bokeh effect

Yezhi Shen, Weichen Xu, Qian Lin, Jan P. Allebach, Fengqing Zhu

DOI

10.2352/EI.2023.35.7.IMAGE-273

Volume 35

Issue 7

Abstract

Video conferencing usage dramatically increased during the pandemic and is expected to remain high in hybrid work. One of the key aspects of video experience is background blur or background replacement, which relies on good quality portrait segmentation in each frame. Software and hardware manufacturers have worked together to utilize depth sensor to improve the process. Existing solutions have incorporated depth map into post processing to generate a more natural blurring effect. In this paper, we propose to collect background features with the help of depth map to improve the segmentation result from the RGB image. Our results show significant improvements over methods using RGB based networks and runs faster than model-based background feature collection models.

Digital Library: EI

Published Online: January 2023

Robust hand hygiene monitoring for food safety using hand images

143 47

Action recognition
Food safety
Hand hygiene

Shengtai Ju, Amy R. Reibman, Amanda J. Deering

DOI

10.2352/EI.2023.35.7.IMAGE-275

Volume 35

Issue 7

Abstract

Hand hygiene is essential for food safety and food handlers. Maintaining proper hand hygiene can improve food safety and promote public welfare. However, traditional methods of evaluating hygiene during food handling process, such as visual auditing by human experts, can be costly and inefficient compared to a computer vision system. Because of the varying conditions and locations of real-world food processing sites, computer vision systems for recognizing handwashing actions can be susceptible to changes in lighting and environments. Therefore, we design a robust and generalizable video system that is based on ResNet50 that includes a hand extraction method and a 2-stream network for classifying handwashing actions. More specifically, our hand extraction method eliminates the background and helps the classifier focus on hand regions under changing lighting conditions and environments. Our results demonstrate our system with the hand extraction method can improve action recognition accuracy and be more generalizable when evaluated on completely unseen data by achieving over 20% improvement on the overall classification accuracy.

Digital Library: EI

Published Online: January 2023

Evaluating the efficacy of skincare product: A realistic short-term facial pore simulation

381 80

Short-term facial pore simulation
Facial pore segmentation
Skincare product efficacy
Facial pore warping
Skin health

Ling Li, Bandara Dissanayake, Tatsuya Omotezako, Yunjie Zhong, Qing Zhang, Rizhao Cai, Qian Zheng, Dennis Sng, Weisi Lin, Yufei Wang, Alex C. Kot

DOI

10.2352/EI.2023.35.7.IMAGE-276

Volume 35

Issue 7

Abstract

Simulating the effects of skincare products on the face is a potential new mode for product self-promotion while facilitating consumers to choose the right product. Furthermore, such simulations enable one to anticipate her skin condition and better manage skin health. However, there is a lack of effective simulations today. In this paper, we propose the first simulation model to reveal facial pore changes after using skincare products. Our simulation pipeline consists of two steps: training data establishment and facial pore simulation. To establish training data, we collect face images with various pore quality indexes from short-term (8-weeks) clinical studies. People experience significant skin fluctuations (due to natural rhythms, external stressors, etc.,) which introduce large perturbations, and we propose a sliding window mechanism to clean data and select representative index(es) to present facial pore changes. The facial pore simulation stage consists of 3 modules: UNet-based segmentation module to localize facial pores; regression module to predict time-dependent warping hyperparameters; and deformation module, taking warping hyperparameters and pore segmentation labels as inputs, to precisely deform pores accordingly. The proposed simulation renders realistic facial pore changes. This work will pave the way for future research in facial skin simulation and skincare product developments.

Digital Library: EI

Published Online: January 2023

Wearable multispectral imaging and telemetry at edge

64 19

Edge Computing
Head-Mounted Display
Wearable
Thermography
Remote Sensing
Medical Imaging
Real-Time Image Processing
Augmented Reality

Yang Cai, Mel Siegel

DOI

10.2352/EI.2023.35.7.IMAGE-278

Volume 35

Issue 7

Abstract

We present a head-mounted holographic display system for thermographic image overlay, biometric sensing, and wireless telemetry. The system is lightweight and reconfigurable for multiple field applications, including object contour detection and enhancement, breathing rate detection, and telemetry over a mobile phone for peer-to-peer communication and incident commanding dashboard. Due to the constraints of the limited computing power of an embedded system, we developed a lightweight image processing algorithm for edge detection and breath rate detection, as well as an image compression codec. The system can be integrated into a helmet or personal protection equipment such as a face shield or goggles. It can be applied to firefighting, medical emergency response, and other first-response operations. Finally, we present a case study of "Cold Trailing" for forest fire prevention in the wild.

Digital Library: EI

Published Online: January 2023

Eidetic recognition of cattle using keypoint alignment

130 47

Eidetic recognition
Cattle recognition
Cattle Keypoint Detection
Keypoint Alignment
Cattle Re-Identification
Cattle ReID
Holstein cattle

Manu Ramesh, Amy R. Reibman, Jacquelyn P. Boerman

DOI

10.2352/EI.2023.35.7.IMAGE-279

Volume 35

Issue 7