
With the rapid progress of multi-modal large language models (mLLMs), there is growing interest in whether such models can act as judges of image quality. A fundamental question exists, however, as to the ability of such models to distinguish between various levels of image quality attributes, such as sharpness and noise. This work represents one of the first systematic investigations of mLLMs as evaluators in classical paired comparison image quality assessment (IQA) experiments. Prior work in mLLM-based vision has focused on captioning or recognition tasks, whereas our study explicitly frames Gemini 2.0 Flash as a proxy subject in psychovisual testing to establish just noticeable differences (JNDs) for sharpness and noise using the Kodak image quality ruler dataset as stimulus. For both sharpness and noise, the magnitudes of JNDs were found to be proportional to the relative quality of the stimulus. Surprisingly, judgments of individual pairs of images were found to be probabilistic rather than absolute, with more uncertainty observed for sharpness discrimination than noise. Prompt engineering is detailed as is the statistical analysis of results. Understanding the extent to which mLLMs can act as reliable perceptual proxies offers transformative implications for automated IQA, dataset labeling, and adaptive imaging pipelines.
Robin Jenkin, Preeti S Pillai, Aruna S Nayak, Abhishek A Joshi, Vasudhaika S, Sinchana C, Abhishek Patil, "Psychovisual Experimentation Using mLLMs as Observers" in Electronic Imaging, 2026, pp 178-1 - 178-8, https://doi.org/10.2352/EI.2026.38.12.GENAI-178