ConText: Driving In-context Learning for Text Removal and Segmentation

Fei Zhang; Pei Zhang; Baosong Yang; Fei Huang; Yanfeng Wang; Ya Zhang

ConText: Driving In-context Learning for Text Removal and Segmentation

Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper for the first time explores the visual in-context learning for some OCR tasks.

Abstract: This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.

Lay Summary: This paper introduces the first application of a new learning paradigm for addressing optical character recognition problems, focusing on text removal and segmentation. The authors propose a step-by-step prompting method (image-removal-segmentation) that helps models reason more effectively through task-in-chain examples. Their model, ConText, achieves state-of-the-art results across multiple benchmarks, and showcases surprising user-instruction interaction.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/Ferenas/ConText

Primary Area: Applications->Computer Vision

Keywords: In-context learning, text removal, text segmentation

Submission Number: 1480

Loading