Abstract: In-context learning (ICL) has emerged as a key capability that enables large language models (LLMs) to adapt effectively to specific tasks, offering both flexibility and improved performance. Recent advancements in extending the context window allow LLMs to process longer sequences, enabling them to benefit from many-shot in-context learning, which has shown to further enhance their performance. To systematically evaluate this emerging many-shot capability, we introduce \emph{MICLBench}, a \textbf{M}any-shot \textbf{ICL} \textbf{Bench}mark spanning four core categories—classification, translation, summarization, and question answering—across 14 diverse tasks, unified under a standardized prompt and API framework. This benchmark is used to evaluate over 19 high-performance models, including both proprietary and open-source ones. Our experiments yield critical insights: (i) most models can benefit from many-shot ICL across various tasks, (ii) many-shot prompts are more effective for tasks that are coarse-grained and require less reasoning, and (iii) performance in the few-shot scenario is not necessarily positively correlated with the ability to effectively utilize many-shot prompts. The benchmark highlights pronounced performance disparities across tasks and models in many-shot settings, while exposing limitations in current LLMs’ ability to harness escalating context examples. These findings provide actionable pathways for advancing models to better leverage expanding context windows. By offering a rigorous, automated evaluation framework, this work underscores the challenges and opportunities in scaling in-context learning to many-shot paradigms. The code is available at \url{https://anonymous.4open.science/r/MICLBench}.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 8049
Loading