Keywords: LLM, Small Models, Emergent Abilities, ListOps, Joint Training
TL;DR: Joint training on compositional tasks lowers model size and makes learning difficult operations easier.
Abstract: The ability of a model to learn a task depends critically on both task difficulty and
model size. We study this relationship for compositional operations, focusing on
nested ListOps and extending beyond arithmetic to permutation groups, with the
goal of determining how task difficulty sets the minimum parameter requirements
for small transformer models. We vary task difficulty by introducing new operations
or combinations of operations into the training data. We find that while operations
such as modular addition or permutation group products are difficult in isolation,
joint training with other operations, including product, maximum, or auxiliary
sub-block operations, reduces the parameter requirements by factors of 2 to 7.
Analysis of learned embeddings using PCA reveals that when joint training helps it
is usually accompanied by an increase in highly regular structures in the embedding
of inputs. These results suggest that joint training leads to qualitatively different
learning trajectories than learning operations in isolation, with shared number
representations supporting difficult tasks such as addition. Our findings further
demonstrate the importance of training curriculum on the emergence of abilities in
language models.
Primary Area: generative models
Submission Number: 15279
Loading