Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts

Published: 10 Jun 2025, Last Modified: 15 Jul 2025MOSS@ICML2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: small language models, few-shot prompting, supervised fine-tuning, model adaptation, representation learning, generalization, low-resource NLP, out-of-distribution robustness, prompt engineering, parameter efficiency, transformer models, task-specific tuning, in-context learning, experimental analysis, model scaling
Abstract: We investigate the generalization capabilities of small language models under two popular adaptation paradigms: few-shot prompting and supervised fine-tuning. While prompting is often favored for its parameter efficiency and flexibility, it remains unclear how robust this approach is in low-resource settings and under distributional shifts. This paper presents a comparative study of prompting and fine-tuning across task formats, prompt styles, and model scales, with a focus on their behavior in both in-distribution and out-of-distribution (OOD) settings. Beyond accuracy, we analyze the internal representations learned by each approach to assess the stability and abstraction of task-specific features. Our findings highlight critical differences in how small models internalize and generalize knowledge under different adaptation strategies. This work offers practical guidance for model selection in low-data regimes and contributes empirical insight into the ongoing debate over prompting versus fine-tuning.
Code: ipynb
Submission Number: 35
Loading