Track: long paper (up to 4 pages)
Keywords: transformer, attention, task-switching
TL;DR: Standard transformers are not better than recurrent networks or MLPs in the small size regime, up to a few million parameters. An exception are transformers based on expressive attention.
Abstract: The attention mechanism is powering rapid
progress in terms of large-scale generative
AI. It is conversely exceedingly difficult
to find small-scale applications
for which attention-based models outperform
traditional approaches, such as multi-layer
perceptrons or recurrent networks. We examine
this problem in the context of `task switching'.
In this framework models work on ongoing token
sequences with the current task being determined
by stochastically interseeded control tokens.
We show that standard transformers cannot solve
a basic reference model, IARC, which is based on
finite-domain arithmetics. The model contains
a trivial unary operation, (I: increment the
current input), a likewise trivial binary operation,
(A: add last two inputs), and reverse
copy, (R), a standard memory task. A fourth control
token, (C), adds recursive context dependency by
modifying current tasks. Tasks are maintained as
long as no new control tokens appears in the
prompt, which happens stochastially every 3-9
steps. We show that transformers, LSTM
recurrent networks and plain MLPs of similar
sizes ($\sim$1.5M parameters) achieve only
modest prediction accuracies, of about 45\%.
As a countertest we trained transformers containing
a modified attention mechanism, expressive attention,
finding performance levels of around 95\%.
Our results indicate that the workings of
attention can be understood better, and even
improved, when comparing qualitatively
different formulations is a task-switching
setting.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 9
Loading