Small Models, Smarter Learning: The Power of Joint Task Training

Small Models, Smarter Learning: The Power of Joint Task Training

ICLR 2026 Conference Submission15279 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Small Models, Emergent Abilities, ListOps, Joint Training

TL;DR: Joint training on compositional tasks lowers model size and makes learning difficult operations easier.

Abstract: The ability of a model to learn a task depends critically on both task difficulty and model size. We study this relationship for compositional operations, focusing on nested ListOps and extending beyond arithmetic to permutation groups, with the goal of determining how task difficulty sets the minimum parameter requirements for small transformer models. We vary task difficulty by introducing new operations or combinations of operations into the training data. We find that while operations such as modular addition or permutation group products are difficult in isolation, joint training with other operations, including product, maximum, or auxiliary sub-block operations, reduces the parameter requirements by factors of 2 to 7. Analysis of learned embeddings using PCA reveals that when joint training helps it is usually accompanied by an increase in highly regular structures in the embedding of inputs. These results suggest that joint training leads to qualitatively different learning trajectories than learning operations in isolation, with shared number representations supporting difficult tasks such as addition. Our findings further demonstrate the importance of training curriculum on the emergence of abilities in language models.

Primary Area: generative models

Submission Number: 15279

Loading