GSDistill: A Unified Paradigm for Geometry- and Semantics-Aware Document Pretraining

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Positive Contrastive Learning, Self-Distillation, Hybrid Regularization, Vision–Encoder Pretraining, Document Layout Understanding, Zero-Shot Retrieval
TL;DR: We propose GSDistill, a pretraining framework that combines multi-positive contrastive learning with self-distillation and hybrid regularization to produce geometry- and semantics-aware document representations
Abstract: We introduce a unified pretraining paradigm for document understanding, grounded in a probability-theoretic formulation of multi-positive alignment and hierarchical self-distillation, which operate as complementary principles under a single objective. Unlike prior modular approaches, our framework redefines document pretraining as multi-positive, layout- and semantics-aware stochastic alignment rather than a collection of heuristic recipes. The model employs two complementary alignment heads: a semantic head, aligning page-level embeddings with OCR-derived text spans, and a geometric head, aligning representations with compact \say{box-text} descriptors that capture class type and structural layout. Both heads are trained with a multi-positive InfoNCE objective that supports one-to-many correspondences, alleviating the text-body bias of single-positive CLIP-style training and delivering markedly improved zero-shot document retrieval accuracy. To further strengthen representation quality, we incorporate a teacher-student self-distillation module with local-global hybrid regularization, enforcing patch-level consistency, global invariance, and embedding diversity. The resulting backbone produces layout-aware, language-grounded document representations that not only accelerate convergence and achieve competitive state-of-the-art results on layout detection benchmarks but also produce structured, consistent page-level embeddings that are naturally compatible with large language models, opening a path to advanced document reasoning and question-answering (QA).
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11049
Loading