In-Context Learning for Discrete Optimal Transport: Can Transformers Sort?
Abstract: The rapid growth of model sizes and training datasets has created a strong demand for *test-time compute*—the ability to perform inference without additional training. At the core of test-time compute is *in-context learning* (ICL), an emerging capability of large language models (LLMs) that enables them to perform statistical inference directly at test time. Recent progress has shed light on the mechanisms underlying in-context learning in statistical tasks: language models can implement linear regression and classification by iteratively extracting features at test time. This naturally raises a broader question: *Can we analyze ICL beyond statistical learning and extend it to discrete algorithmic tasks relevant to NLP?*
One of the fundamental tasks in NLP can be formulated as discrete optimal transport: matching tokens, with applications ranging from machine translation to mixture-of-experts routing. We show that transformers with softmax self-attention can solve discrete optimal transport via in-context learning when the model parameters are fixed and only the input length and data distribution vary. One implication of this result is that transformers can approximately sort lists of arbitrary length with a provable approximation guarantee.
Submission Number: 390
Loading