Phase Incompatibility Explains Cross-Modal Alignment Failure: Evidence from 144 Model Pairs

NeurIPS 2025 Workshop NeurReps Submission137 Authors

01 Sept 2025 (modified: 29 Oct 2025)Submitted to NeurReps 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Neural Manifold, Average Gradient Outer Product
Abstract: Why do pretrained vision and language models fail catastrophically at cross-modal alignment, achieving less than 3\% accuracy while excelling at 90\%+ within their own modalities? We investigate this paradox through the lens of dynamical systems theory, analyzing 144 vision-language model pairs to uncover the mechanisms behind universal alignment failure. Our investigation reveals a fundamental cause: independent pretraining drives models into incompatible dynamical phases. Using Neural Tangent Kernel (NTK) analysis, we discover that 75\% of model pairs exist in chaotic phases where gradient directions between modalities are nearly orthogonal ($S_{\text{NTK}}^{\text{cross}} < 0.25$). This phase incompatibility persists across all architectural combinations—even between Vision Transformers and BERT variants that share similar architectures. We establish that phase metrics can predict alignment failure before training begins: models with Average Gradient Outer Product (AGOP) ratios exceeding $10^6$ are guaranteed to fail with 94\% accuracy. These findings challenge the Platonic Representation Hypothesis by demonstrating that while models may converge to similar representations, they embed them in incompatible coordinate systems of their optimization landscapes. Our results explain why standard transfer learning fails across modalities and suggest that successful cross-modal learning requires phase-aware training methods that maintain dynamical compatibility from the outset.
Submission Number: 137
Loading