Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0
Track: Proceedings Track
Keywords: Multimodal Learning, Vision-Text Compression, Product Retrieval, CLIP
Abstract: Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.
Submission Number: 115
Loading