Establishing a Scale for Kullback--Leibler Divergence in Language Models Across Various Settings

ACL ARR 2026 January Submission1668 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, KL Divergence, Training Dynamics, Model Comparison, Log-likelihood Vector
Abstract: Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1668
Loading