HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, hyperbolic geometry, mixture-of-experts
TL;DR: We introduce HELM, a family of hyperbolic large language models operating fully in hyperbolic space.
Abstract: Frontier large language models (LLMs) have shown great success in text modeling and generation tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations such as dot-products and norms. Furthermore, recent studies have shown that not respecting the underlying geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in $\textit{Hyperbolic space}$, known for its expansive, scale-free, and low-distortion properties. To this end, we introduce $\textbf{HELM}$, a family of $\textbf{H}$yp$\textbf{E}$rbolic Large $\textbf{L}$anguage $\textbf{M}$odels, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a $\textbf{Mi}$xture-of-$\textbf{C}$urvature $\textbf{E}$xperts model, $\textbf{HELM-MiCE}$, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, $\textbf{HELM-D}$. For $\textbf{HELM-MiCE}$, we further develop hyperbolic Multi-Head Latent Attention ($\textbf{HMLA}$) for efficient, reduced-KV-cache training and inference. For both models, we further develop essential hyperbolic equivalents of rotary positional encodings and root mean square normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our $\textbf{HELM}$ architectures – up to 4\% – over popular Euclidean architectures used in LLaMA and DeepSeek with superior semantic hierarchy modeling capabilities, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale language model pretraining.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 23148
Loading