The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Ekin Akyürek; Mehul Damani; Adam Zweiger; Linlu Qiu; Han Guo; Jyothish Pari; Yoon Kim; Jacob Andreas

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Test-time training on in-context examples outperforms standard few-shot learning, boosting LM abstract reasoning and adaptability.

Abstract: Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT)—temporarily updating model parameters during inference using a loss derived from input data—as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines—reaching $53.0\%$ on the public validation set with an 8B-parameter LM and $61.9\%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5\%$ to $57.8\%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

Lay Summary: Language models are good at solving problems they've seen before—but struggle when faced with unfamiliar challenges. Our research shows that we can significantly boost their performance by allowing them to "update" themselves a little during the test itself, without needing to retrain them from scratch. We apply this idea—called test-time training (TTT)—to two especially difficult benchmarks: one that uses visual puzzles (ARC) (similar to IQ tests) and another with complex language tasks (BIG-Bench Hard). By letting the model adjust itself slightly using the examples it sees during testing, we achieved up to six times better accuracy on ARC and over 7 percentage points improvement on BIG-Bench Hard compared to traditional approaches. Our method helps models adapt to unfamiliar tasks in real time, making them more useful in situations where they encounter something new or unexpected.

Link To Code: https://github.com/ekinakyurek

Primary Area: Deep Learning->Large Language Models

Keywords: Test-Time Training, In-Context Learning, Few-Shot Learning, ARC, BIG-Bench Hard

Submission Number: 15418

Loading