Personalized Vision via Visual In-Context Learning

09 Apr 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: personalization, visual in-context learning, diffusion models
Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks such as segmentation but struggle to adapt flexibly to personalized vision tasks—tasks defined at test-time by users with customized objects or novel objectives. Existing personalization approaches typically rely on synthesizing additional training data or fine-tuning the entire model, limiting flexibility and incurring significant computational cost. Inspired by recent advances in natural language processing, we explore a new direction: leveraging visual generative models for personalized vision via in-context learning. We introduce a structured four-panel input format, where a single annotated example specifies the personalized visual task, allowing the model to interpret and generalize the task to new inputs without further fine-tuning. To enable this one-shot capability, we construct a Visual-Relation tuning dataset tailored to personalized vision in-context learning. Extensive experiments demonstrate that our approach (i) surpasses fine-tuning and synthetic-data baselines on personalized segmentation, (ii) enables test-time definition of novel personalized tasks, and (iii) generalizes across both visual recognition and generation settings. Our work establishes a new paradigm for personalized vision, combining the adaptability of in-context learning with the visual reasoning capabilities of generative models.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 1441
Loading