Analysis of Emergence of Reasoning in Language Models: Factors, Thresholds and Interpretations

Yen-Che Hsiao; Abhishek Dutta

Analysis of Emergence of Reasoning in Language Models: Factors, Thresholds and Interpretations

Yen-Che Hsiao, Abhishek Dutta

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: In-context learning, language models, chain-of-thoughts, explainable AI, reasoning ability of language models

TL;DR: This work analyzes when and how reasoning abilities emerge in small (<3B) decoder-only language models, examining in-context learning, zero-shot prompting, and fine-tuning, with attention analysis revealing key factors in reasoning performance.

Abstract: This work investigates and provides insights into the reasoning thresholds of open-source, decoder-only, transformer-based language models (LMs) with less than three billion parameters by studying three key aspects: reasoning with in-context learning, zero-shot reasoning, and fine-tuning of small models for zero-shot reasoning. The reasoning ability of LMs using in-context learning is evaluated using deductive reasoning tasks, where we show that reasoning ability is influenced by model size and architecture, such as feedforward width and number of attention heads, as well as by properties of the pretraining data, including scale, diversity, long-range coherence, and the ordering of in-context demonstrations. For zero-shot reasoning, we show that fine-tuning LMs on instruction and code data, the use of prompting strategies such as plan-and-solve and role-play, and the depth of LMs can all contribute to improved zero-shot reasoning performance. Regarding the fine-tuning of small LMs, we show that LMs can acquire logical reasoning abilities through instruction tuning with chain-of-thought data, with or without exemplars, and through knowledge distillation. To support the above conclusions, we analyze multi-head attention to correlate with multiple reasoning paths and apply attention unembedding to identify which tokens are written to the residual stream. These findings provide a clearer understanding of the conditions under which reasoning abilities emerge in LMs.

Submission Number: 231

Loading