Harmonic Loss Trains Interpretable AI Models

David D. Baek; Ziming Liu; Riya Tyagi; Max Tegmark

Harmonic Loss Trains Interpretable AI Models

David D. Baek, Ziming Liu, Riya Tyagi, Max Tegmark

Published: 29 Nov 2025, Last Modified: 29 Nov 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we introduce harmonic loss as an alternative supervisory signal for training neural networks and large language models (LLMs). Harmonic loss differs from standard cross-entropy loss by (a) replacing the usual SoftMax normalization with a scale-invariant HarMax function and (b) computing logits via Euclidean distance rather than a dot product. Harmonic loss enables improved interpretability and faster convergence, owing to its scale invariance and finite convergence point by design, which can be interpreted as a class center. We first validate the performance of harmonic models across algorithmic, vision, and language datasets. Through extensive experiments, we demonstrate that models trained with harmonic loss perform better than standard models by: (a) enhancing interpretability (i.e. geometry of representations), (b) requiring less data for generalization, and (c) reducing grokking. Moreover, we compare a GPT-2 model trained with harmonic loss to the standard GPT-2, illustrating that the harmonic model develops more interpretable representations. We hope our work will inspire future research exploring various methods to improve the geometry of representations, paving the way toward building more interpretable AI models.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: N/A

Assigned Action Editor: ~Quanshi_Zhang1

Submission Number: 5549

Loading