Can Transformers Really Do It All? On the Compatibility of Inductive Biases Across Tasks

Can Transformers Really Do It All? On the Compatibility of Inductive Biases Across Tasks

ICLR 2026 Conference Submission6301 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformers, language models, inductive biases, length generalization, activation functions

TL;DR: We optimize transformers by replacing GeLUs/softmaxes with parametrized splines optimized on held-out data. This tools reveals when task-specific architectures can be better.

Abstract: Transformers are remarkably versatile and their design is largely consistent across a variety of applications. But are they optimal for any given task or dataset? The answer may be key for pushing AI beyond the mere scaling of current designs. *Method.* We present a method to optimize a transformer architecture for a given dataset, which we use as a tool to study optimal task-specific inductive biases. The method replaces the most important non-linearities (GeLUs, softmax) with components optimized on held out data. We then use each resulting new architecture with other datasets as a way to evaluate the compatibility between pairs of tasks. **Findings.** On a range of popular algorithmic tasks, our method identifies new architectures with dramatic improvements in learning speed, generalization, and stability across seeds. These designs prove very task-specific, which means that the tasks require inductive biases very different from those of standard transformers. On a range of code and language modeling datasets, we also find architectures with consistent, yet smaller improvements. These designs now transfer much better across datasets, domains (English vs. computer code), and tokenizations. **Implications.** These results show that standard transformers are rarely a local optimum in the space of architectures. We show that alternative designs can perform better, but they often sacrifice universality. This calls for future work on architectures that could serve multiple objectives such as fluency and robust reasoning.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 6301

Loading