Keywords: personalization, visual in-context learning, diffusion models
Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks such as segmentation but struggle to adapt flexibly to personalized vision tasks—tasks defined at test-time by users with customized objects or novel objectives.
Existing personalization approaches typically rely on synthesizing additional training data or fine-tuning the entire model, limiting flexibility and incurring significant computational cost.
Inspired by recent advances in natural language processing, we explore a new direction: leveraging visual generative models for personalized vision via in-context learning.
We introduce a structured four-panel input format, where a single annotated example specifies the personalized visual task, allowing the model to interpret and generalize the task to new inputs without further fine-tuning.
To enable this one-shot capability, we construct a Visual-Relation tuning dataset tailored to personalized vision in-context learning.
Extensive experiments demonstrate that our approach (i) surpasses fine-tuning and synthetic-data baselines on personalized segmentation, (ii) enables test-time definition of novel personalized tasks, and (iii) generalizes across both visual recognition and generation settings.
Our work establishes a new paradigm for personalized vision, combining the adaptability of in-context learning with the visual reasoning capabilities of generative models.
Supplementary Material:  zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 1441
Loading