Low-Dimensional Document Structure Subspaces in Specialized vs. Emergent OCR Models: A Mechanistic Interpretability Study of Three Architectures

Guus Bouwens

Low-Dimensional Document Structure Subspaces in Specialized vs. Emergent OCR Models: A Mechanistic Interpretability Study of Three Architectures

Guus Bouwens

Published: 11 Jun 2026, Last Modified: 20 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Feature Geometry, Applications of interpretability, Methods (probing, steering, causal interventions)

Other Keywords: OCR models, vision-language models, subspace analysis, representation geometry, Document Structure Modularity, cross-model CKA

TL;DR: Through PCA subspace analysis and causal interventions, we uncover that specialized OCR models exhibit extreme representational modularity and compression compared to general-purpose VLMs, despite high cross-architecture alignment.

Abstract: Optical character recognition (OCR) models have become critical infrastructure for document intelligence, yet their internal representations remain mechanistically unexplored. We present the first mechanistic interpretability study comparing three architectures: GLM-OCR (0.9B), a purpose-built document recognition system; PaddleOCR-VL (1.5B), a second specialized OCR model; and Qwen3.5-2B, a general-purpose VLM with emergent OCR capability. Using PCA-based subspace analysis on 300 real RVL-CDIP document images per model, we find that document structure capabilities occupy partially disentangled, low-dimensional subspaces in all three models. PaddleOCR-VL exhibits the most concentrated representations PC1 explains 84.0% of variance; effective rank 2.0 at its bottleneck), while Qwen3.5-2B is the most distributed (PC1 = 63.7%; effective rank 3.8). We introduce the Document Structure Modularity Index ($\modularity$), and find that both specialized models achieve higher modularity (GLM-OCR: 0.715; PaddleOCR-VL: 0.774) than the general-purpose baseline (0.704). Cross-model CKA reveals high representational alignment across all pairs ($>0.90$), with the specialized-to-emergent CKA marginally exceeding the specialized-to-specialized CKA---a finding with implications for representational universality in OCR.

Submission Number: 12

Loading