In-Context Learning for Discrete Optimal Transport: Can Transformers Sort?
Abstract: The rapid growth of model sizes and training datasets has created a strong demand for *test-time compute*—the ability to perform inference without additional training. At the core of test-time compute is *in-context learning*, an emerging capability of large language models (LLMs) that enables them to perform statistical inference directly at test time. Recent progress has shed light on the mechanisms of in-context learning in statistical learning: language models can implement linear regression and classification by iteratively extracting features at test time. This naturally raises a broader question: *can in-context learning extend beyond statistical learning to combinatorial problems and integer programs?*
We take a first step toward addressing this question by analyzing the fundamental task of *sorting* and the more general problem of *discrete optimal transport*. We prove that transformers with softmax self-attention layers can solve discrete optimal transport through in-context learning, when model parameters remain fixed and only the input length and distribution vary. Consequently, transformers can approximately sort lists of arbitrary length within a provable approximation factor. This sorting mechanism fundamentally differs from discrete sorting algorithms: transformers implicitly simulate a *continuous optimization process* through their internal feature-extraction dynamics.
Submission Number: 390
Loading