Do Generalisation Results Generalise?

ACL ARR 2026 January Submission9422 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: generalization, data influence, natural language inference, evaluation
Abstract: A large language model's (LLM's) out-of-distribution (OOD) generalisation is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered during deployment are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo, OPT and SmolLM, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: generalization,data influence,natural language inference,evaluation
Languages Studied: English
Submission Number: 9422
Loading