
Scientific figures (charts, composite panels, and data visualizations) are routinely inaccessible to visually impaired readers because screen readers cannot interpret visual content and published captions are often too brief or domain-specific to convey what the figure shows. Vision-language models (VLMs) offer a potential route to automated, accessible image description at scale. In this study, we evaluate five open-source, instruction-tuned VLMs (BLIP-2, LLaVA-1.5-7B, Moondream2, Qwen2-VL-2B, and Idefics3-8B) on a dataset of 245 scientific figures drawn from 32 papers presented at Electronic Imaging 2025. Generated captions are scored against author-provided ground-truth captions using four complementary metrics: BLEU, ROUGE-L, Sentence-BERT cosine similarity (SBERT), and RefCLIPScore. Moondream2 achieves the highest performance across all semantic metrics (RefCLIPScore = 1.025, SBERT = 0.392) despite being one of the smallest models evaluated (~1.86B parameters), offering the best balance of quality and speed (8.7 s per image). The four metrics tell a consistent story: Moondream2 scores low on lexical match but high on semantic similarity and image alignment, which is the expected pattern when detailed visual descriptions are compared against brief author captions. These findings are broadly paralleled in an evaluation of VLM-generated captions performed by a small sample of actual publication authors. Besides highlighting the suitability of the aforementioned VLMs in aiding visually impaired individuals, the explored approaches may also serve as orientation for familiarizing authors and publishers of scientific articles with the needs of assistive tech and the increasing expectations in accessibility regulations.

Vision-language pre-trained (VLP) models, such as CLIP, have exhibited remarkable performance in downstream tasks with excellent generalization capabilities. Meanwhile, textual and visual prompt learning have been widely adopted to enhance VLP model performance in downstream tasks. However, a challenging issue in visual prompt learning is the inferior ability on few-shot recognition tasks, the inability to capture specific class information. Thus, we propose a class-aware visual prompt learning method to enhance the perceptual abilities of VLP model with an independent class prompting module, which consists of trainable prompts for each class. As class-aware prompts tend to be inaccurate in the training process, we developed an intra-class compactness loss and inter-class dispersion loss to enhance the intra-class consistency. Finally, we introduced attention-based adapter layers to tackle the prompt selection issue. Extensive experiments demonstrated that our method achieved superior efficiency and effectiveness, surpassing previous visual prompting methods in a series of downstream datasets.