Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show that transformers adopt distinct counting strategies, relation-based and inventory-based, which shape learning regimes and are influenced by architectural choices, impacting performance and robustness in a simple histogram task.
Abstract: Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.
Lay Summary: We studied how small changes to a popular type of AI model for language modelling, a transformer, affect how it solves a very basic task: counting how many times each item appears in a list. Even though this sounds simple, the way the model goes about solving it can vary a lot depending on how it is built. We found that transformers use two main strategies to count: one compares items in the list directly, while the other keeps track of everything and pulls out the answer later. These strategies split the work differently across parts of the model and sometimes the part that looks at all the items together does most of the work, and other times it’s the part that processes each item one by one. Surprisingly, even small changes can make the model much better at counting, like adding a special symbol at the start of the list or tweaking how the model blends information. This shows that model design really matters, even for simple tasks and that those tasks might not be so simple.
Link To Code: https://github.com/SPOC-group/counting-attention
Primary Area: Deep Learning->Theory
Keywords: learning theory, representation learning, algorithmic tasks, attention, associative memories, implicit bias, learning regimes
Submission Number: 2598
Loading