Low-Dimensional Document Structure Subspaces in Specialized vs. Emergent OCR Models: A Mechanistic Interpretability Study of Three Architectures
Keywords: Feature Geometry, Applications of interpretability, Methods (probing, steering, causal interventions)
Other Keywords: OCR models, vision-language models, subspace analysis, representation geometry, Document Structure Modularity, cross-model CKA
TL;DR: Through PCA subspace analysis and causal interventions, we uncover that specialized OCR models exhibit extreme representational modularity and compression compared to general-purpose VLMs, despite high cross-architecture alignment.
Abstract: Optical character recognition (OCR) models have become critical infrastructure
for document intelligence, yet their internal representations remain
mechanistically unexplored. We present the first mechanistic interpretability
study comparing three architectures: GLM-OCR (0.9B), a purpose-built document
recognition system; PaddleOCR-VL (1.5B), a second specialized OCR model; and
Qwen3.5-2B, a general-purpose VLM with emergent OCR capability. Using
PCA-based subspace analysis on 300 real RVL-CDIP document images per model, we
find that document structure capabilities occupy partially disentangled,
low-dimensional subspaces in all three models. PaddleOCR-VL exhibits the most
concentrated representations PC1 explains 84.0% of variance; effective
rank 2.0 at its bottleneck), while Qwen3.5-2B is the most distributed
(PC1 = 63.7%; effective rank 3.8). We introduce the Document
Structure Modularity Index ($\modularity$), and find that both specialized
models achieve higher modularity (GLM-OCR: 0.715; PaddleOCR-VL: 0.774) than
the general-purpose baseline (0.704). Cross-model CKA reveals high
representational alignment across all pairs ($>0.90$), with the
specialized-to-emergent CKA marginally exceeding the specialized-to-specialized
CKA---a finding with implications for representational universality in OCR.
Submission Number: 12
Loading