SCVO: Addressing Sparse But Critical Variable Overwhelm In VLMs For Advertising Image Preference Prediction Across Multi-Country Markets
Keywords: Vision-Language Models (VLMs), Sparse Critical Variable Overwhelm (SCVO), Multinational Advertising Image, Preference Prediction
TL;DR: This paper proposed NationalJudge, a framework that addresses Sparse Critical Variable Overwhelm (SCVO) in Vision-Language Models to improve multinational advertising image preference prediction.
Abstract: Vision-language models (VLMs) have demonstrated remarkable capabilities in multimodal tasks, yet their sensitivity to sparse and critical variables that are often overwhelmed remains unexplored. The image preference prediction across multi-country markets task serves as a representative case in this regard. Specifically, VLMs (e.g., Qwen-VL) are tasked with judging between two images (A and B) for the same product across diverse markets (e.g., Korea, France), and the model’s predictions often collapse to a single output (e.g., always ``A'') despite ground-truth preferences varying by country. This failure is attributed to Sparse Critical Variable Overwhelm (SCVO): the model is overwhelmed by dominant high-volume variables (e.g., product attributes, image patches consuming hundreds of tokens), while the critical low-volume variables (e.g., country names consuming only a few tokens) are statistically drowned out. To study this, we first collect a dataset of real-world advertising image click-through preferences across multi-country markets, and then present a novel training framework that strategically mitigates SCVO, and we use it to train on the dataset, yielding CountryReward, a reward model for advertising image preference prediction across multi-country markets. Our framework involves three tailored modules: (1) a cross-country retrieval-augmented generation module that injects click-through preferences aligned with target markets into the training process, enhancing localized relevance prediction. (2) a country adapter module that dynamically modulates image representations based on textual country embeddings, enabling precise visual preference adaptation for diverse markets. (3) a focus-driven penalty loss function that penalizes mispredictions related to the overlooked variable more heavily. Finally, we apply the CountryReward as the reward model to finetune VLMs through Reinforcement Learning (RL), enabling the model to output background designs fed to the text-to-image model (e.g., SDXL) and generate effective e-commerce images for a targeted country. Experiments on the proposed dataset show that our approach significantly mitigates the SCVO effect and improves the preference prediction accuracy. This work highlights the need for robust handling of sparse critical variables in VLMs and offers a scalable solution for real-world applications where subtle contextual shifts drive decision-making.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16953
Loading