Explore the Textual Perception Ability on the Images for Multimodal Large Language Models

Jiayi Kuang, Jiarui Ouyang, Ying Shen

Published: 01 Jan 2024, Last Modified: 19 May 2025NLPCC (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Chinese Spell Checking or Chinese Spelling Correction (CSC) in writing assistants has significantly enhanced users’ text quality. Multimodal CSC expands this task into the multimodal domain and introduces a new research challenge: understanding textual information at the image level. The traditional two-stage strategy, which directly maps images to text before performing checking and correction, often faces issues such as the many-to-one mapping problem of fake characters and misrecognizing fake characters as correct ones. To address these issues, we propose a unified one-stage framework based on the emerging Multimodal Large Language Models (MLLMs) to achieve a discrete semantic understanding of text directly at the image level, thus alleviating the image-to-text mapping problem. Additionally, we introduce an adaptation strategy for MLLM in the multimodal CSC task, enhancing its few-shot learning capability through in-context learning with prompt design. We evaluate different MLLM base models and verify the effectiveness of this one-stage framework, demonstrating performance that matches or even surpasses fine-tuned baseline models without fine-tuning. Furthermore, we conduct a series of analyses that provide insights into visual perception and text correction.