The Role of Task Complexity in Emergent Abilities of Small Language Models

Csaba Both; Benjamin Hoover; Hendrik Strobelt; Dmitry Krotov; Daniel Karl I. Weidele; Mauro Martino; Nima Dehmamy

The Role of Task Complexity in Emergent Abilities of Small Language Models

Csaba Both, Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Daniel Karl I. Weidele, Mauro Martino, Nima Dehmamy

25 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, ListOps, Emergent Abilities, Scaling Laws

TL;DR: We investigate how task complexity impacts the model size required for small transformers to learn specific tasks, showing that parameter requirements grow with complexity, while multitask learning accelerates the learning of more challenging tasks.

Abstract: We investigate the relationship between task complexity and the minimum model size required for learning specific tasks in small transformer models. We focus on the ListOps dataset, consisting of nested math operations. We define the task complexity as the Kolmogorov complexity (KC) of the code solving the task, using a rough proxy for KC. We find a power-law relation between KC and parameters required to learn, suggesting number of parameters to learn harder task increases almost cubic in KC. On individual math operations, sum mod 10 is hardest to learn. Surprisingly, when combining tasks, we observe that sum is learned earlier and with fewer parameters when trained alongside max and median. Analyzing the model, we find strong evidence that models trained on sum alone and models trained jointly converge to different algorithms. Concretely, the sum alone model doesn’t seem to have learned number properties in the embedding layer, likely memorizing the sum table. In contrast, models trained on three tasks (maximum, median and sum) reveals that joint training results in clear number-like properties. Finally, we also find evidence that the sum-only model utilizes its feedforward layer more than the jointly trained model. Conversely, the attention layer in the joint model is activated more than the sum model. Our findings suggest there is another dimension to emergent abilities in language models, namely the algorithms being learned, potentially impacting scaling laws.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4114

Loading