
Historical paintings reflect the social, cultural, and religious contexts of their time. With the emergence of vision–language models (VLMs), it has become possible to generate textual interpretations from images; however, it remains unclear what information these models rely on and how their outputs should be evaluated. This study examines the characteristics and validity of VLM-generated interpretations through two experiments. First, an art style classification task using the Pandora dataset shows that VLMs tend to group paintings into historically related styles, although strict distinctions are not always achieved. Second, focusing on religious paintings, we evaluate the agreement between generated interpretations and museum descriptions using BERTScore under three conditions: image only, image with metadata, and metadata only. Results indicate that metadata improves scores, while visual input has limited impact. Moreover, evaluation outcomes depend strongly on the content of reference texts. These findings suggest that VLM-based interpretation relies more on linguistic context than visual information and highlight limitations of using museum descriptions as evaluation references.