INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS
Track: tiny paper (up to 4 pages)
Keywords: Token pruning, Document understanding, Vision-language models
TL;DR: We introduce an index-preserving, lightweight token pruning method that removes background patches before visual encoding, cutting document-understanding VLM FLOPs by up to ~60% while maintaining near-original accuracy.
Abstract: Recent progress in vision-language models (VLMs) has led to strong accuracy on document understanding tasks such as parsing and key information extraction, but processing high-resolution document images remains computationally expensive. We propose a lightweight pre-encoder token pruning framework that removes non-informative background patches using a binary text-region classifier with a max-pooling refinement step. The framework preserves token indices to maintain the spatial correspondence required for layout-sensitive recognition. Experiments on real-world document benchmarks show 40–60\% FLOPs reduction while maintaining comparable accuracy.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 21
Loading