Why are small transformers not better?

ICLR 2025 Workshop ICBINB Submission9 Authors

30 Jan 2025 (modified: 05 Mar 2025)Submitted to ICLR 2025 Workshop ICBINBEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: transformer, attention, task-switching
TL;DR: Standard transformers are not better than recurrent networks or MLPs in the small size regime, up to a few million parameters. An exception are transformers based on expressive attention.
Abstract: The attention mechanism is powering rapid progress in terms of large-scale generative AI. It is conversely exceedingly difficult to find small-scale applications for which attention-based models outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of `task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interseeded control tokens. We show that standard transformers cannot solve a basic reference model, IARC, which is based on finite-domain arithmetics. The model contains a trivial unary operation, (I: increment the current input), a likewise trivial binary operation, (A: add last two inputs), and reverse copy, (R), a standard memory task. A fourth control token, (C), adds recursive context dependency by modifying current tasks. Tasks are maintained as long as no new control tokens appears in the prompt, which happens stochastially every 3-9 steps. We show that transformers, LSTM recurrent networks and plain MLPs of similar sizes ($\sim$1.5M parameters) achieve only modest prediction accuracies, of about 45\%. As a countertest we trained transformers containing a modified attention mechanism, expressive attention, finding performance levels of around 95\%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations is a task-switching setting.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 9
Loading