Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Zhen Zhao; Jingqun Tang; Chunhui Lin; Binghong Wu; Can Huang; Hao Liu; Xin Tan; Zhizhong Zhang; Yuan Xie

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Can Huang, Hao Liu, Xin Tan, Zhizhong Zhang, Yuan Xie

Published: 01 Jan 2024, Last Modified: 30 Apr 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Scene text recognition (STR) in the wild frequently en-counters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a spe-cific scenario, but it is computationally intensive and re-quires multiple model copies for various scenarios. Re-cent studies indicate that large language models (LLMs) can learn from afew demonstration examples in a training-free manner, termed “In-Context Learning” (ICL). Never-theless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from di-verse samples in the training stage. To this end, we intro-duce E2 STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E2 STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E2 STR exhibits remarkable training-free adaptation in var-ious scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedanceIE2STR.

Loading