Rethinking OCR-based Long-Context Modeling via Hybrid Visual–Textual Inputs

08 Nov 2025 (modified: 11 Nov 2025)THU 2025 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Long-context Processing, OCR
Abstract: Effectively extending the context length of Large Language Models (LLMs) remains a fundamental bottleneck in advancing their capabilities. Traditional token-based approaches typically rely on lightweight long-context continuous pretraining to expand the supported context window, while recent OCR-based methods render long documents into images and feed their compressed visual embeddings into the LLM backbone. However, neither paradigm achieves a satisfactory trade-off between length generalizability and fine-grained detail preservation. Token-based methods struggle with out-of-distribution issues when processing inputs longer than those seen during training, whereas OCR-based methods sacrifice fine-grained information due to the complete removal of token-level inputs. To address this limitation, we are planning to propose a hybrid framework that simultaneously incorporates visual embeddings of the full long-context input and token embeddings near the query. This design aims to achieve a more balanced trade-off between generalizability to longer sequences and the ability to retain fine-grained details.
Submission Number: 10
Loading