Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven LVMs

Shao-Jun Xia; Huixin Zhang; Zhengzhong Tu; Jing Bao

Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven LVMs

Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu, Jing Bao

14 Sept 2025 (modified: 06 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual In-Context Learning; Vision-Language Models; Task Transfer; Low-Level; Low-Cost Reasoning

Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context, without any parameter updates. Recent advances in visual domain, i.e. visual in-context learning (VICL), demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). However, the boundaries of cross-task transfer in VICL remain largely unexplored, particularly for the heterogeneity across low-level vision tasks. This naturally raises the question: \textit{When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL?} In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we present a training strategy from a large VLM to a small vision-language model (sVLM), together with a deployment framework from the sVLM back to the large VLM. Furthermore, we propose a novel inference framework that combines perceptual score-based reasoning with standard evaluation metrics to perform cross-task VICL. Our approach achieves stable results spanning multiple low-level cross-task pairs. During inference, T2T-VICL demonstrates promising performance without requiring any image-based training or model fine-tuning. Our findings highlight the feasibility of enabling cross-task VICL within VLMs, underscoring the utility as a supplementary generalizable paradigm for low-cost vision-language reasoning.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4967

Loading