SlimDoc: lightweight distillation of document transformer models

Published: 2025, Last Modified: 14 Nov 2025Int. J. Document Anal. Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deploying state-of-the-art document understanding models remains resource-intensive and impractical in many real-world scenarios, particularly where labeled data is scarce and computational budgets are constrained. To address these challenges, this work proposes a novel approach towards parameter-efficient document understanding models capable of adapting to specific tasks and document types without the need for labeled data. Specifically, we propose an approach coined SlimDoc to distill multimodal document transformer encoder models into smaller student models, using internal signals at different training stages, followed by external signals. Our approach is inspired by TinyBERT and adapted to the domain of document understanding transformers. We demonstrate SlimDoc to outperform both a single-stage distillation and a direct fine-tuning of the student. Experimental results across six document understanding datasets demonstrate our approach’s effectiveness: Our distilled student models achieve on average \(93.0\%\) of the teacher’s performance, while the fine-tuned students achieve \(87.0\%\) of the teacher’s performance. Without requiring any labeled data, we create a compact student which achieves \(96.0\%\) of the performance of its supervised-distilled counterpart and \(86.2\%\) of the performance of a supervised-fine-tuned teacher model. We demonstrate our distillation approach to pick up on document geometry and to be effective on the two popular document understanding models LiLT and LayoutLMv3. Our implementation and training data is available at https://github.com/marcel-lamott/SlimDoc.
Loading