How Do Transformers Align Tokens?

NeurIPS 2025 Workshop NeurReps Submission5 Authors

13 Aug 2025 (modified: 29 Oct 2025)Submitted to NeurReps 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sorting, Optimal Transport, In-context Learning, Transformers
Abstract: How can language models align words in translated sentences with different syntactic structures? Can they compute edit distances—or even sort arbitrary sequences? These tasks are examples of *assignment problems*. We prove a carefully engineered prompt enables a transformer to approximate the solution of assignment. This prompt induces attention layers to simulate gradient descent on the dual objective of assignment. We establish an **explicit approximation bound** that improves with transformer depth. A striking implication is that a single transformer can sort inputs of **arbitrary** length—proving a form of *out-of-distribution generalization*.
Submission Number: 5
Loading