Capabilities of Image-to-text Transformation Models for Enabling Visually Impaired to Perceive Complex Imaging Visuals at Conferences and Scientific Journals

Ruthra  Bellan; Frank  Wittig; Reiner  Creutzburg

doi:10.2352/EI.2026.38.3.MOBMU-322

Abstract

Scientific figures (charts, composite panels, and data visualizations) are routinely inaccessible to visually impaired readers because screen readers cannot interpret visual content and published captions are often too brief or domain-specific to convey what the figure shows. Vision-language models (VLMs) offer a potential route to automated, accessible image description at scale. In this study, we evaluate five open-source, instruction-tuned VLMs (BLIP-2, LLaVA-1.5-7B, Moondream2, Qwen2-VL-2B, and Idefics3-8B) on a dataset of 245 scientific figures drawn from 32 papers presented at Electronic Imaging 2025. Generated captions are scored against author-provided ground-truth captions using four complementary metrics: BLEU, ROUGE-L, Sentence-BERT cosine similarity (SBERT), and RefCLIPScore. Moondream2 achieves the highest performance across all semantic metrics (RefCLIPScore = 1.025, SBERT = 0.392) despite being one of the smallest models evaluated (~1.86B parameters), offering the best balance of quality and speed (8.7 s per image). The four metrics tell a consistent story: Moondream2 scores low on lexical match but high on semantic similarity and image alignment, which is the expected pattern when detailed visual descriptions are compared against brief author captions. These findings are broadly paralleled in an evaluation of VLM-generated captions performed by a small sample of actual publication authors. Besides highlighting the suitability of the aforementioned VLMs in aiding visually impaired individuals, the explored approaches may also serve as orientation for familiarizing authors and publishers of scientific articles with the needs of assistive tech and the increasing expectations in accessibility regulations.

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA

10.2352/EI.2026.38.3.MOBMU-322

MOBMU-322

Proceedings Paper

Capabilities of Image-to-text Transformation Models for Enabling Visually Impaired to Perceive Complex Imaging Visuals at Conferences and Scientific Journals

BellanRuthra

SRH University of Applied Sciences -Campus Berlin, Germany

WittigFrank

SRH University of Applied Sciences -Campus Berlin, Germany

CreutzburgReiner

SRH University of Applied Sciences -Campus Berlin, Germany

German University of Digital Science, Germany

Technische Hochschule Brandenburg, Germany

Abstract

132026

MOBMU

Mobile Devices and Multimedia: Enabling Technologies, Algorithms, and Applications 2026

322-1

322-10

2026

vision-language modelsaccessibilityscientific figuresautomated captioningRefCLIPScorevisual impairmentblind

articleview.keywords