IS&T | Library

Automating Product Image Analysis for Retail with Gemini

26 12

Gemini
multi-modal
product recognition
retail
image analysis

Tianli Yu, Daniel Vlasic

Pages 262-1 - 262-7, February 2025, This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

DOI

10.2352/EI.2025.37.8.IMAGE-262

Volume 37

Issue 8

Abstract

We present the application of a Multimodal Large Language Model, specifically Gemini, in automating product image analysis for the retail industry. We demonstrate how Gemini's ability to generate text based on mixed image-text prompts enables two key applications: 1) Product Attribute Extraction, where various attributes of a product in an image can be extracted using open or closed vocabularies and used for any downstream analytics by the retailers, and 2) Product Recognition, where a product in a user-provided image is identified, and its corresponding product information is retrieved from a retailer's search index to be returned to the user. In both cases, Gemini acts as a powerful and easily customizable recognition engine, simplifying the processing pipeline for retailers' developer teams. Traditionally, these tasks required multiple models (object detection, OCR, attributes classification, embedding, etc) working together, as well as extensive custom data collection and domain expertise. However, with Gemini, these tasks are streamlined by writing a set of prompts and straightforward logic to connect their outputs.

Digital Library: EI

Published Online: February 2025

ZAR: Zero-shot Action Recognition with Dynamic Prompt Tuning

1 0

multi-modal
action recognition
prompt tuning

Qiyue Liang, Cheng Lu, Chun Tao, Jan P. Allebach

DOI

10.2352/EI.2025.37.8.IMAGE-263

Volume 37

Issue 8

Abstract

Pre-trained vision-language models, exemplified by CLIP, have exhibited promising zero-shot capabilities across various downstream tasks. Trained on image-text pairs, CLIP is naturally extendable to video-based action recognition, due to the similarity between processing images and video frames. To leverage this inherent synergy, numerous efforts have been directed towards adapting CLIP for action recognition tasks in videos. However, the specific methodologies for this adaptation remain an open question. Common approaches include prompt tuning and fine-tuning with or without extra model components on video-based action recognition tasks. Nonetheless, such adaptations may compromise the generalizability of the original CLIP framework and also necessitate the acquisition of new training data, thereby undermining its inherent zero-shot capabilities. In this study, we propose zero-shot action recognition (ZAR) by adapting the CLIP pre-trained model without the need for additional training datasets. Our approach leverages the entropy minimization technique, utilizing the current video test sample and augmenting it with varying frame rates. We encourage the model to make consistent decisions, and use this consistency to dynamically update a prompt learner during inference. Experimental results demonstrate that our ZAR method achieves state-of-the-art zero-shot performance on the Kinetics-600, HMDB51, and UCF101 datasets.

Digital Library: EI

Published Online: February 2025

Convolutional Shared Dictionary Module for Few-shot Learning

31 10

few-shot learning
dictionary learning
reconstructed feature

Tong Zhou, Changyin Dong, Zhen Wang, Bo Chang, Junshu Song, Bin Shen, Baodi Liu, Pengfei He

DOI

10.2352/EI.2025.37.8.IMAGE-265

Volume 37

Issue 8

Abstract

Few-shot learning is the most prevalent problem which has attracted lots of attention in recent years. It is a powerful research method in the case of limited training data. Simultaneously, few-shot learning methods based on metric learning mainly measure the similarity of feature embeddings between the query set sample and each class of support set samples. Therefore, how to design a CNN-based feature extractor is the most crucial problem. Nowadays, the existed feature extractors are obtained via training the standard convolutional networks (e.g., ResNet), which merely focuses on the information inside each image. However, the relations among samples may also be beneficial to promote the performance of the few-shot learning task. This paper proposes a Convolutional Shared Dictionary Module (CSDM) to find the hidden structural information among samples for few-shot learning and reduce the dimension of sample features to remove redundant information. Therefore, the learned dictionary is more easily adapt to the novel class, and the reconstructed features are more discriminative. Moreover, the CSDM is a plug-and-play module and integrates the dictionary learning algorithm into the feature embedding. Experimental results on several benchmark datasets have demonstrated the effectiveness of the proposed CSDM.

Digital Library: EI

Published Online: February 2025

Optimization of Image Captioning Networks Using Targeted Component Pruning Method

48 13

Image Captioning
Network Pruning
Transformer

Jishu Sen Gupta, Yogendra Rao Musunuri, Ih-Man Seo, Oh-Seol Kwon

DOI

10.2352/EI.2025.37.8.IMAGE-266

Volume 37

Issue 8

Abstract

Deep learning models have significantly advanced, leading to substantial improvements in image captioning performance over the past decade. However, these improvements have resulted in increased model complexity and higher computational costs. Contemporary captioning models typically consist of three components such as a pre-trained CNN encoder, a transformer encoder, and a decoder. Although research has extensively explored the network pruning for captioning models, it has not specifically addressed the pruning of these three individual components. As a result, existing methods lack the generalizability required for models that deviate from the traditional configuration of image captioning systems. In this study, we introduce a pruning technique designed to optimize each component of the captioning model individually, thus broadening its applicability to models that share similar components, such as encoders and decoder networks, even if their overall architectures differ from the conventional captioning models. Additionally, we implemented a novel modification during the pruning in the decoder through the cross-entropy loss, which significantly improved the performance of the image-captioning model. Furthermore, we trained and validated our approach on the Flicker8k dataset and evaluated its performance using the CIDEr and ROUGE-L metrics.

Digital Library: EI

Published Online: February 2025

Enhanced Edge Feature Learning Network for Mammogram Classification

54 11

Early Diagnosis
Sobel Edge Detection
Detail Enhancement
Image Classification

Yongsheng Liu, Wenzong Jiang, Bin Shen, Weifeng Liu, Baodi Liu

DOI

10.2352/EI.2025.37.8.IMAGE-268

Volume 37

Issue 8

Abstract

Breast cancer is the leading malignant tumor worldwide, and early diagnosis is crucial for effective treatment. Computer-aided diagnostic models based on deep learning have significantly improved the accuracy and efficiency of medical diagnosis. However, tumor edge features are critical information for determining benign and malignant, but existing methods underutilize tumor edge information, which limits the ability of early diagnosis. To enhance the study of breast lesion features, we propose the enhanced edge feature learning network (EEFL-Net) for mammogram classification. EEFL-Net enhances the learning of pathology features through the Sobel edge detection module and edge detail enhancement module (EDEM). The Sobel edge detection module performs processing to identify and enhance the key edge information. The image then enters the EDEM to fine-tune the processing further and enhance the detailed features, thus improving the classification results. Experiments on two public datasets (INbreast and CBIS-DDSM) show that EEFL-Net performs better than previous advanced mammography image classification methods.

Digital Library: EI

Published Online: February 2025

VotingNet: Adaptive Hough Voting Based Compositional Model for X-ray Prohibited Item Detection Under Occlusion

50 14

Object detection
Security inspection
Occlusion
X-ray images
Compositional model
Voting

Kaitao Huang, Yan Yan

DOI

10.2352/EI.2025.37.8.IMAGE-270

Volume 37

Issue 8

Abstract

Recently, X-ray prohibited item detection has been widely used for security inspection. In practical applications, the items in the luggage are severely overlapped, leading to the problem of occlusion. In this paper, we address prohibited item detection under occlusion from the perspective of the compositional model. To this end, we propose a novel VotingNet for occluded prohibited item detection. VotingNet incorporates an Adaptive Hough Voting Module (AHVM) based on the generalized Hough transform into the widely-used detector. AHVM consists of an Attention Block (AB) and a Voting Block (VB). AB divides the voting area into multiple regions and leverages an extended Convolutional Block Attention Module (CBAM) to learn adaptive weights for inter-region features and intra-region features. In this way, the information from unoccluded areas of the prohibited items is fully exploited. VB collects votes from the feature maps of different regions given by AB. To improve the performance in the presence of occlusion, we combine AHVM with the original convolutional branches, taking full advantage of the robustness of the compositional model and the powerful representation capability of convolution. Experimental results on OPIXray and PIDray datasets show the superiority of VotingNet on widely used detectors (including representative anchor-based and anchor-free detectors).

Digital Library: EI

Published Online: February 2025

Identification of Cultural Artifacts using Deep Learning

25 8

cultural artifacts
deep learning
self-attention
artifact dataset

Huajian Liu, Xiaoying Yang, Raphael Antonius Frick, Martin Steinebach

DOI

10.2352/EI.2025.37.8.IMAGE-271

Volume 37

Issue 8

Abstract

This work addresses the challenge of identifying the provenance of illicit cultural artifacts, a task often hindered by the lack of specialized expertise among law enforcement and customs officials. To facilitate immediate assessments, we propose an improved deep learning model based on a pre-trained ResNet model, fine-tuned for archaeological artifact recognition through transfer learning. Our model uniquely integrates multi-level feature extraction, capturing both textural and structural features of artifacts, and incorporates self-attention mechanisms to enhance contextual understanding. In addition, we developed two different artifact datasets: a dataset with mixed types of earthenware and a dataset for coins. Both datasets are categorized according to the age and region of artifacts. Evaluations of the proposed model on these datasets demonstrate improved recognition accuracy thanks to the enhanced feature representation.

Digital Library: EI

Published Online: February 2025

Frontal View Synthesis for Immersive Video Conferencing using Dual-camera Capture and Frame Interpolation

20 3

frontal view synthesis
frame Interpolation
video conferencing
portrait segmentation
facial landmark detection
deep learning

Yezhi Shen, Md Adnan Faisal Hossain, Weichen Xu, Qian Lin, Fengqing Zhu

DOI

10.2352/EI.2025.37.8.IMAGE-273

Volume 37

Issue 8

Abstract

In this paper, we propose a new solution for synthesizing frontal human images in video conferencing, aimed at enhancing immersive communication. Traditional methods such as center staging, gaze correction, and background replacement improve the user experience, but they do not fully address the issue of off-center camera placement. We introduce a system that utilizes two arbitrary cameras positioned on the top bezel of a display monitor to capture left and right images of the video participant. A facial landmark detection algorithm identifies key points on the participant’s face, from which we estimate the head pose. A segmentation model is employed to remove the background, isolating the user. The core component of our method is a video frame interpolation technique that synthesizes a realistic frontal view of the participant by leveraging the two captured angles. This method not only enhances visual alignment between users but also maintains natural facial expressions and gaze direction, resulting in a more engaging and life-like video conferencing experience.

Digital Library: EI

Published Online: February 2025

RGBD Routed Blending: A 3D Reconstruction Pipeline for Video Conferencing

20 3

3D Reconstruction
3D Video Conferencing
Deep learning

Fan Bu, Qian Lin, Jan Allebach

DOI

10.2352/EI.2025.37.8.IMAGE-274

Volume 37

Issue 8