INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS

Jaemin Son; Sujin Choi; Inyong Yun

INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS

Jaemin Son, Sujin Choi, Inyong Yun

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: tiny paper (up to 4 pages)

Keywords: Token pruning, Document understanding, Vision-language models

TL;DR: We introduce an index-preserving, lightweight token pruning method that removes background patches before visual encoding, cutting document-understanding VLM FLOPs by up to ~60% while maintaining near-original accuracy.

Abstract: Recent progress in vision-language models (VLMs) has led to strong accuracy on document understanding tasks such as parsing and key information extraction, but processing high-resolution document images remains computationally expensive. We propose a lightweight pre-encoder token pruning framework that removes non-informative background patches using a binary text-region classifier with a max-pooling refinement step. The framework preserves token indices to maintain the spatial correspondence required for layout-sensitive recognition. Experiments on real-world document benchmarks show 40–60\% FLOPs reduction while maintaining comparable accuracy.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 21

Loading