Back to articles
Proceedings Paper
Volume: 38 | Article ID: GENAI-175
Image
Vision-language Learning for Wireless Capsule Endoscopy: Diagnostic Captioning with CLIP
  DOI :  10.2352/EI.2026.38.12.GENAI-175  Published OnlineMarch 2026
Abstract
Abstract

Wireless Capsule Endoscopy (WCE) is a minimally invasive diagnostic tool for examining the gastrointestinal tract, but the interpretation of large amounts of WCE image data demands extensive manual efforts and expert knowledge. Deep learning offers a promising approach to automate WCE data analysis, but training robust models is hindered by the scarcity of large-scale, high-quality labeled data in the WCE domain. This study explores the use of Contrastive Language-Image Pre-training (CLIP), a vision-language model pre-trained on extensive image-text pairs, to address these challenges in deep learning for WCE. We focus on caption retrieval and pathology classification tasks, using the CAPTIV8 dataset, a multi-modal WCE dataset containing image-diagnostic text pairs. After customizing the dataset for deep learning tasks, we conducted experiments comparing CLIP with state-of-the-art vision models. The results demonstrated that CLIP performs better than vision-only models, particularly in small-sample regimes such as one-shot and few-shot setups. By replacing the original CLIP loss with a KL-divergence loss, we further enhanced the model’s ability to handle multiple positive pairs in a mini-batch during the training, to further attune learning for this specific medical domain.

Subject Areas :
Views 47
Downloads 13
 articleview.views 47
 articleview.downloads 13
  Cite this article 

Lu Xu, Anuja Vats, Marius Pedersen, Kiran Raja, "Vision-language Learning for Wireless Capsule Endoscopy: Diagnostic Captioning with CLIPin Electronic Imaging,  2026,  pp 175-1 - 175-9,  https://doi.org/10.2352/EI.2026.38.12.GENAI-175

 Copy citation
  Copyright statement 
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
ei
Electronic Imaging
2470-1173
2470-1173
Society for Imaging Science and Technology
IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA