MICLBench: Benchmarking LLMs with Hundreds of In-Context Examples

MICLBench: Benchmarking LLMs with Hundreds of In-Context Examples

ACL ARR 2025 February Submission8049 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In-context learning (ICL) has emerged as a key capability that enables large language models (LLMs) to adapt effectively to specific tasks, offering both flexibility and improved performance. Recent advancements in extending the context window allow LLMs to process longer sequences, enabling them to benefit from many-shot in-context learning, which has shown to further enhance their performance. To systematically evaluate this emerging many-shot capability, we introduce \emph{MICLBench}, a \textbf{M}any-shot \textbf{ICL} \textbf{Bench}mark spanning four core categories—classification, translation, summarization, and question answering—across 14 diverse tasks, unified under a standardized prompt and API framework. This benchmark is used to evaluate over 19 high-performance models, including both proprietary and open-source ones. Our experiments yield critical insights: (i) most models can benefit from many-shot ICL across various tasks, (ii) many-shot prompts are more effective for tasks that are coarse-grained and require less reasoning, and (iii) performance in the few-shot scenario is not necessarily positively correlated with the ability to effectively utilize many-shot prompts. The benchmark highlights pronounced performance disparities across tasks and models in many-shot settings, while exposing limitations in current LLMs’ ability to harness escalating context examples. These findings provide actionable pathways for advancing models to better leverage expanding context windows. By offering a rigorous, automated evaluation framework, this work underscores the challenges and opportunities in scaling in-context learning to many-shot paradigms. The code is available at \url{https://anonymous.4open.science/r/MICLBench}.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 8049

Loading