Abstract: Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by an \emph{LLM-oriented semantic similarity} and an \emph{inference stability of demonstrations}, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It facilitates to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish. Our code is available at \url{https://github.com/Ljyustc/LMS3}.
Lay Summary: Large language models (LLMs), like GPT-4, can solve math problems just by seeing a few examples — a skill called in-context learning. But surprisingly, giving these models these examples doesn't always help — sometimes it even hurts performance. Why?
Our research provides a theoretical explanation. We show that the effectiveness of examples depends on two key factors: how semantically similar they are to the new problem, and how stable the model's reasoning is on the examples themselves. This helps us understand when and why examples help or hurt — moving beyond trial-and-error toward a more principled understanding. Based on this insight, we introduce LMS3, a simple and general method that selects the most helpful examples and filters out harmful ones.
Our work provides deeper insights into how LLMs actually reason with examples — shedding light on the inner workings of these powerful models. This makes LLMs more trustworthy for solving math problems and could benefit broader automated reasoning tasks.
Link To Code: https://github.com/Ljyustc/LMS3
Primary Area: Deep Learning->Large Language Models
Keywords: In context learning, Large Language Models
Submission Number: 9433
Loading