
With the rapid progress of multi-modal large language models (mLLMs), there is growing interest in whether such models can act as judges of image quality. A fundamental question exists, however, as to the ability of such models to distinguish between various levels of image quality attributes, such as sharpness and noise. This work represents one of the first systematic investigations of mLLMs as evaluators in classical paired comparison image quality assessment (IQA) experiments. Prior work in mLLM-based vision has focused on captioning or recognition tasks, whereas our study explicitly frames Gemini 2.0 Flash as a proxy subject in psychovisual testing to establish just noticeable differences (JNDs) for sharpness and noise using the Kodak image quality ruler dataset as stimulus. For both sharpness and noise, the magnitudes of JNDs were found to be proportional to the relative quality of the stimulus. Surprisingly, judgments of individual pairs of images were found to be probabilistic rather than absolute, with more uncertainty observed for sharpness discrimination than noise. Prompt engineering is detailed as is the statistical analysis of results. Understanding the extent to which mLLMs can act as reliable perceptual proxies offers transformative implications for automated IQA, dataset labeling, and adaptive imaging pipelines.

With the proliferation of text-to-image generative AI, understanding the fidelity of their output is critical. While these models can generate visually stunning images, their interpretation of nuanced, subjective concepts like color names remains largely unquantified. This paper introduces a systematic framework to evaluate how accurately leading generative AI models (including Flux, Ideogram, Kandinsky, Gemini and Stable Diffusion) understand and reproduce colors from textual prompts. We prompted these models with both one-word (e.g., ”blue”) and two-word (e.g., ”sky blue”) color names to generate uniform color fields. The resulting images were analyzed by converting them to the perceptually uniform CIE Lab color space. An adaptive k-means clustering algorithm was employed to extract the dominant color, mitigating issues of non-uniformity in the generated images. By calculating the perceptual color difference using CIEDE2000 (ΔE00) and the chromatic distance (Δab) between the AI-generated colors and standardized ground-truth values, we provide a quantitative benchmark of each model’s color accuracy. Our findings reveal that while all models broadly understand the mapping between color names and hue, significant performance variations exist among models, with systematic differences in lightness and chroma reproduction. Per-model analysis reveals a clear hierarchy in chromatic fidelity: Gemini and Flux demonstrate the strongest anchoring, while Kandinsky exhibits striking hue-dependent anisotropy and Stable Diffusion shows the broadest isotropic dispersion. Per-color analysis identifies systematic undersaturation of short-wavelength and high-chroma colors (blue, indigo, magenta) across all models, while warm colors (red, orange, yellow) are generally better grounded. We highlight that results vary significantly across random seeds for the same prompt and model, and that lexical specificity generally—but not universally—improves chromatic grounding. This work provides a robust methodology for auditing and improving color fidelity in future generative models.

Digital watermarks for texts come in numerous forms. The text itself, but also its appearance, i.e. font, letter spacing or line spacing, can be modified. Here, we present an approach that marks the text itself by introducing changes to the written words. For this, numerous methods are known, such as change from active to passive, modulation of sentence lengths or replacements with synonyms. We use ChatGPT to supplement existing texts with suggestions for synonymous formulations. We also look at evaluating the transparency of the marked texts with the help of ChatGPT.

With the exponential growth of large language models (LLMs), enhancing model adaptability for diverse real-world applications has become crucial. This study critically examines domain-specific fine-tuning of ChatGPT and explores the potential of In-Context Learning (ICL) as a complementary strategy, highlighting the delicate balance between generalizability and specificity in health promotion communication. Employing two distinct fine-tuning strategies—single-prompt interactions and multi-turn conversation models—the research advances current methodologies for tailoring LLMs to specialized domains. By incorporating approaches such as data augmentation, transfer learning, and adaptive fine-tuning, alongside structured Meta-Prompting, the study systematically evaluates ChatGPT’s adaptability in handling health-specific dialogues, comparing model performance across varied interaction types. Case studies and targeted customization strategies underscore the practical utility and significant impact of these adaptations in applied health communication contexts, demonstrating the enhanced contextual understanding in multi-turn interactions. Results indicate the superior efficacy of the multi-turn approach in managing nuanced, contextually rich dialogues, underscoring the capacity of the model for sustained engagement in health-related discourse. ICL with Meta-Prompting, on the other hand, demonstrates notable flexibility and resource efficiency. These findings have significant implications for advancing AI in health communication, suggesting a developmental trajectory that integrates technological sophistication with a focus on empathetic user engagement.