Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, JEPA, data efficiency, Geodesic Hypothesis, Semantic Tube Prediction
Abstract: Large Language Models (LLMs) obey consistent scaling laws---empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws---which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16$\times$ less training data, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling.
Submission Number: 21
Loading