Is Sorting Hard for Transformers?

Nathan W. Henry; Shan Chen; Huangyuan Su; Khashayar Gatmiry; Jonathan Kelner

Is Sorting Hard for Transformers?

Nathan W. Henry, Shan Chen, Huangyuan Su, Khashayar Gatmiry, Jonathan Kelner

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Transformer mechanistic interpretability, sorting, Length generalization

TL;DR: We dissect why transformers that can sort in theory often fail in practice—showing that positional biases are the main culprit, and that width and MLPs are key to robust sorting and length generalization.

Abstract: Sorting is a canonical algorithmic primitive that also appears implicitly in practical settings like information retrieval and ranking, yet trained transformers can remain brittle despite their theoretical capacity for perfect sorting. We present an empirical and mechanistic study of small decoder-only transformers trained from scratch on listwise integer sorting, sweeping architecture choices across vocabulary size, width (embedding dimension), depth, MLP presence, and positional encoding, and separately evaluating regimes with and without duplicates. Duplicates induce a sharp shift from an “easy” regime to a capacity-limited one: accuracy becomes bimodal and often collapses as models struggle to implement robust tie-handling logic. Across this sweep, embedding dimension is the dominant driver of success; depth provides consistent but secondary gains, while MLPs are especially important under duplicates, supporting the view that nonlinear processing helps resolve ambiguities that attention alone cannot disambiguate. Counter to intuition, absolute positional encodings systematically hurt performance on this permutation-equivariant task. We further analyze internal computation, finding that early-layer attention produces the key intermediate signals for sorting and that later attention can play a comparatively smaller role in the settings we study. Finally, we study length generalization by comparing curriculum learning to training on fully mixed sequence lengths, and we provide a constructive condition showing that, in the no-duplicate setting, a single-layer, single-head transformer can in principle sort lists of arbitrary length.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 66

Loading