How Do Transformers Align Tokens?

Hadi Daneshmand

How Do Transformers Align Tokens?

Hadi Daneshmand

Published: 04 Oct 2025, Last Modified: 10 Oct 2025DiffCoAlg 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimal Transport, Assignment, In-context Learning, Transformers, Sorting

Abstract: How can language models align words in translated sentences with different syntactic structures? Can they compute edit distances—or even sort arbitrary sequences? These tasks are examples of *assignment problems*. We prove a carefully engineered prompt enables a transformer to approximate the solution of assignment. This prompt induces attention layers to simulate gradient descent on the dual objective of assignment. We establish an **explicit approximation bound** that improves with transformer depth. A striking implication is that a single transformer can sort inputs of **arbitrary** length—proving a form of *out-of-distribution generalization*.

Submission Number: 2

Loading