Back to articles
Proceedings Paper
Volume: 38 | Article ID: MOBMU-322
Image
Capabilities of Image-to-text Transformation Models for Enabling Visually Impaired to Perceive Complex Imaging Visuals at Conferences and Scientific Journals
  DOI :  10.2352/EI.2026.38.3.MOBMU-322  Published OnlineMarch 2026
Abstract
Abstract

Scientific figures (charts, composite panels, and data visualizations) are routinely inaccessible to visually impaired readers because screen readers cannot interpret visual content and published captions are often too brief or domain-specific to convey what the figure shows. Vision-language models (VLMs) offer a potential route to automated, accessible image description at scale. In this study, we evaluate five open-source, instruction-tuned VLMs (BLIP-2, LLaVA-1.5-7B, Moondream2, Qwen2-VL-2B, and Idefics3-8B) on a dataset of 245 scientific figures drawn from 32 papers presented at Electronic Imaging 2025. Generated captions are scored against author-provided ground-truth captions using four complementary metrics: BLEU, ROUGE-L, Sentence-BERT cosine similarity (SBERT), and RefCLIPScore. Moondream2 achieves the highest performance across all semantic metrics (RefCLIPScore = 1.025, SBERT = 0.392) despite being one of the smallest models evaluated (~1.86B parameters), offering the best balance of quality and speed (8.7 s per image). The four metrics tell a consistent story: Moondream2 scores low on lexical match but high on semantic similarity and image alignment, which is the expected pattern when detailed visual descriptions are compared against brief author captions. These findings are broadly paralleled in an evaluation of VLM-generated captions performed by a small sample of actual publication authors. Besides highlighting the suitability of the aforementioned VLMs in aiding visually impaired individuals, the explored approaches may also serve as orientation for familiarizing authors and publishers of scientific articles with the needs of assistive tech and the increasing expectations in accessibility regulations.

Subject Areas :
Views 21
Downloads 6
 articleview.views 21
 articleview.downloads 6
  Cite this article 

Ruthra Bellan, Frank Wittig, Reiner Creutzburg, "Capabilities of Image-to-text Transformation Models for Enabling Visually Impaired to Perceive Complex Imaging Visuals at Conferences and Scientific Journalsin Electronic Imaging,  2026,  pp 322-1 - 322-10,  https://doi.org/10.2352/EI.2026.38.3.MOBMU-322

 Copy citation
  Copyright statement 
Copyright ©2026 Society for Imaging Science and Technology 2026
ei
Electronic Imaging
2470-1173
2470-1173
Society for Imaging Science and Technology
IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA