Automating Product Image Analysis for Retail with Gemini

Tianli  Yu; Daniel  Vlasic

doi:10.2352/EI.2025.37.8.IMAGE-262

Abstract

We present the application of a Multimodal Large Language Model, specifically Gemini, in automating product image analysis for the retail industry. We demonstrate how Gemini's ability to generate text based on mixed image-text prompts enables two key applications: 1) Product Attribute Extraction, where various attributes of a product in an image can be extracted using open or closed vocabularies and used for any downstream analytics by the retailers, and 2) Product Recognition, where a product in a user-provided image is identified, and its corresponding product information is retrieved from a retailer's search index to be returned to the user. In both cases, Gemini acts as a powerful and easily customizable recognition engine, simplifying the processing pipeline for retailers' developer teams. Traditionally, these tasks required multiple models (object detection, OCR, attributes classification, embedding, etc) working together, as well as extensive custom data collection and domain expertise. However, with Gemini, these tasks are streamlined by writing a set of prompts and straightforward logic to connect their outputs.

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA

10.2352/EI.2025.37.8.IMAGE-262

IMAGE-262

Proceedings Paper

Automating Product Image Analysis for Retail with Gemini

YuTianli

Google Cloud AI, US

VlasicDaniel

Google Cloud AI, US

Abstract

222025

Imaging and Multimedia Analytics at the Edge 2025

262-1

262-7

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Geminimulti-modalproduct recognitionretailimage analysis

articleview.keywords