Regular
Compositional model
dictionary learning
few-shot learning
Image Captioning
Network Pruning
OcclusionObject detection
reconstructed feature
Security inspection
Transformer
Voting
X-ray images
 Filters
Month and year
 
  8  1
Image
Pages 265-1 - 265-7,  © 2025 Society for Imaging Science and Technology 2025
Volume 37
Issue 8
Abstract

Few-shot learning is the most prevalent problem which has attracted lots of attention in recent years. It is a powerful research method in the case of limited training data. Simultaneously, few-shot learning methods based on metric learning mainly measure the similarity of feature embeddings between the query set sample and each class of support set samples. Therefore, how to design a CNN-based feature extractor is the most crucial problem. Nowadays, the existed feature extractors are obtained via training the standard convolutional networks (e.g., ResNet), which merely focuses on the information inside each image. However, the relations among samples may also be beneficial to promote the performance of the few-shot learning task. This paper proposes a Convolutional Shared Dictionary Module (CSDM) to find the hidden structural information among samples for few-shot learning and reduce the dimension of sample features to remove redundant information. Therefore, the learned dictionary is more easily adapt to the novel class, and the reconstructed features are more discriminative. Moreover, the CSDM is a plug-and-play module and integrates the dictionary learning algorithm into the feature embedding. Experimental results on several benchmark datasets have demonstrated the effectiveness of the proposed CSDM.

Digital Library: EI
Published Online: February  2025
  14  2
Image
Pages 266-1 - 266-5,  © 2025 Society for Imaging Science and Technology 2025
Volume 37
Issue 8
Abstract

Deep learning models have significantly advanced, leading to substantial improvements in image captioning performance over the past decade. However, these improvements have resulted in increased model complexity and higher computational costs. Contemporary captioning models typically consist of three components such as a pre-trained CNN encoder, a transformer encoder, and a decoder. Although research has extensively explored the network pruning for captioning models, it has not specifically addressed the pruning of these three individual components. As a result, existing methods lack the generalizability required for models that deviate from the traditional configuration of image captioning systems. In this study, we introduce a pruning technique designed to optimize each component of the captioning model individually, thus broadening its applicability to models that share similar components, such as encoders and decoder networks, even if their overall architectures differ from the conventional captioning models. Additionally, we implemented a novel modification during the pruning in the decoder through the cross-entropy loss, which significantly improved the performance of the image-captioning model. Furthermore, we trained and validated our approach on the Flicker8k dataset and evaluated its performance using the CIDEr and ROUGE-L metrics.

Digital Library: EI
Published Online: February  2025
  16  1
Image
Pages 270-1 - 270-6,  © 2025 Society for Imaging Science and Technology 2025
Volume 37
Issue 8
Abstract

Recently, X-ray prohibited item detection has been widely used for security inspection. In practical applications, the items in the luggage are severely overlapped, leading to the problem of occlusion. In this paper, we address prohibited item detection under occlusion from the perspective of the compositional model. To this end, we propose a novel VotingNet for occluded prohibited item detection. VotingNet incorporates an Adaptive Hough Voting Module (AHVM) based on the generalized Hough transform into the widely-used detector. AHVM consists of an Attention Block (AB) and a Voting Block (VB). AB divides the voting area into multiple regions and leverages an extended Convolutional Block Attention Module (CBAM) to learn adaptive weights for inter-region features and intra-region features. In this way, the information from unoccluded areas of the prohibited items is fully exploited. VB collects votes from the feature maps of different regions given by AB. To improve the performance in the presence of occlusion, we combine AHVM with the original convolutional branches, taking full advantage of the robustness of the compositional model and the powerful representation capability of convolution. Experimental results on OPIXray and PIDray datasets show the superiority of VotingNet on widely used detectors (including representative anchor-based and anchor-free detectors).

Digital Library: EI
Published Online: February  2025

Keywords

[object Object] [object Object] [object Object] [object Object] [object Object] [object Object] [object Object] [object Object] [object Object] [object Object] [object Object]