SCVO: Addressing Sparse But Critical Variable Overwhelm In VLMs For Advertising Image Preference Prediction Across Multi-Country Markets
Keywords: Vision-Language Models (VLMs), Sparse Critical Variable Overwhelm (SCVO), Multinational Advertising Image, Preference Prediction
TL;DR: This paper proposed NationalJudge, a framework that addresses Sparse Critical Variable Overwhelm (SCVO) in Vision-Language Models to improve multinational advertising image preference prediction.
Abstract: Vision language models (VLMs) have demonstrated remarkable capabilities in multimodal tasks, yet their sensitivity to sparse, critical, and overwhelmed variables remains unexplored. The image preference prediction across multi-country markets task serves as a representative case in this regard. Specifically, VLMs (e.g., QwenVL) are tasked with judging between two images (A and B) for the same product across diverse markets (e.g., Korea, France), the model’s predictions often collapse to a single output (e.g., always "A") despite ground-truth preferences varying by country. This failure is attributed to Sparse Critical Variable Overwhelm (SCVO): the model is overwhelmed by dominant high-volume variables (e.g., product attributes, image patches consuming hundreds of tokens), while the critical low-volume variables (e.g., country names consuming only a few tokens) is statistically drowned out. To study this, we firstly collect dataset, a real-world advertising image click-through preference across multi-country markets, and then a novel training framework that strategically mtigate SCVO is presented and used to trained with the dataset yiedling to CountryReward, a judge model for advertising image preference prediction across multi-country markets. Our framework involves three tailored modules: (1) a cross-country retrieval augmentation generation that injects historical click-through preferences aligned with target markets into the model training, enhancing localized relevance prediction. (2) a country adapter module that dynamically modulates image representations based on textual country embeddings, enabling precise visual preference adaptation for diverse markets. (3) an focus-driven penalty loss function that penalizes mispredictions related to the overlooked variable more heavily. Finally, we apply the CountryReward as the reward model to fine-tune VLMs through Reinforcement Learning (RL) which can output background designs fed to text-to-image model (e.g., SDXL) and generate effective e-commerce image for targeted country. Experiments on a the proposed dataset show that our approach significantly mitigates the SCVO effect and improves the preference prediction accuracy. This work highlights the need for robust handling of sparse critical variables in VLMs and offers a scalable solution for real-world applications where subtle contextual shifts drive decision-making.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16953
Loading