Keywords: Attention Mechanism, Transformer, Boolean Function, Hardness
Abstract: We study the computational limits of learning $k$-bit Boolean functions (specifically, $\mathrm{AND}$, $\mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=\Theta(d)$ relevant bits are selected from $d$ inputs.
We show that these simple $\mathrm{AND}$ and $\mathrm{OR}$ functions are unsolvable with
a single-head softmax-attention mechanism alone.
However, with \textit{teacher forcing}, the same minimalist attention is capable of solving them.
These findings offer two key insights:
Architecturally, solving these Boolean tasks requires only \textit{minimalist attention}, without deep Transformer blocks or FFNs.
Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems.
Together, the bounds expose a fundamental gap between what this minimal architecture achieves
under ideal supervision and what is provably impossible under standard training.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12380
Loading