A Quantitative Framework for Evaluating Color-name Understanding in Generative AI Models

Robin  Jenkin; Vijayalaxmi  M; Shailesh  Pawale; Francis  Fernandes; Ashutosh  Naryagol; Salman  Sanadi

doi:10.2352/EI.2026.38.12.GENAI-180

Abstract

With the proliferation of text-to-image generative AI, understanding the fidelity of their output is critical. While these models can generate visually stunning images, their interpretation of nuanced, subjective concepts like color names remains largely unquantified. This paper introduces a systematic framework to evaluate how accurately leading generative AI models (including Flux, Ideogram, Kandinsky, Gemini and Stable Diffusion) understand and reproduce colors from textual prompts. We prompted these models with both one-word (e.g., ”blue”) and two-word (e.g., ”sky blue”) color names to generate uniform color fields. The resulting images were analyzed by converting them to the perceptually uniform CIE Lab color space. An adaptive k-means clustering algorithm was employed to extract the dominant color, mitigating issues of non-uniformity in the generated images. By calculating the perceptual color difference using CIEDE2000 (ΔE00) and the chromatic distance (Δab) between the AI-generated colors and standardized ground-truth values, we provide a quantitative benchmark of each model’s color accuracy. Our findings reveal that while all models broadly understand the mapping between color names and hue, significant performance variations exist among models, with systematic differences in lightness and chroma reproduction. Per-model analysis reveals a clear hierarchy in chromatic fidelity: Gemini and Flux demonstrate the strongest anchoring, while Kandinsky exhibits striking hue-dependent anisotropy and Stable Diffusion shows the broadest isotropic dispersion. Per-color analysis identifies systematic undersaturation of short-wavelength and high-chroma colors (blue, indigo, magenta) across all models, while warm colors (red, orange, yellow) are generally better grounded. We highlight that results vary significantly across random seeds for the same prompt and model, and that lexical specificity generally—but not universally—improves chromatic grounding. This work provides a robust methodology for auditing and improving color fidelity in future generative models.

articleview.keywords