Many-Shot In-Context Learning

Rishabh Agarwal; Avi Singh; Lei M Zhang; Bernd Bohnet; Luis Rosias; Stephanie C.Y. Chan; Biao Zhang; Ankesh Anand; Zaheer Abbas; Azade Nova; John D Co-Reyes; Eric Chu; Feryal Behbahani; Aleksandra Faust; Hugo Larochelle

Many-Shot In-Context Learning

Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie C.Y. Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: large language models, in-context learning, long-context models

TL;DR: We investigate the many-shot in-context learning regime -- prompting large language models with hundreds or thousands of examples -- for a wide range of tasks.

Abstract: Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples – the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore two new settings: (1) "Reinforced ICL" that uses model-generated chain-of-thought rationales in place of human rationales, and (2) "Unsupervised ICL" where we remove rationales from the prompt altogether, and prompts the model only with domain-specific inputs. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. We demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to supervised fine-tuning. Finally, we reveal the limitations of next-token prediction loss as an indicator of downstream ICL performance.

Primary Area: Natural language processing

Submission Number: 3528

Loading