Abstract: In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because candidate contexts must be validated with repeated LLM calls. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate each training query's success rate, trains a lightweight router to predict difficulty from the query, and trains level-specific judges to score sampled contexts. At inference, DiSP performs stop-on-acceptance judging under an explicit budget and typically makes a single LLM call, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama 3–8B and Qwen 2.5–7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4%, while achieving up to 23× end-to-end wall-clock speedup.
Lay Summary: When people use a large language model for a new task, they often give it a few example questions and answers to imitate. Which examples are chosen can strongly affect whether the model answers correctly, but searching through possible example sets is slow and expensive. This paper asks a simpler question: instead of trying to find the perfect examples, can we quickly judge whether a sampled set is good enough? We introduce DiSP, a method that first estimates how difficult a question is, then tests random example sets with helper models that predict whether the large model is likely to succeed. If a set looks promising, DiSP stops searching and sends it to the large model; if none look reliable within the budget, it falls back and marks the case as risky. This makes the cost of using examples more predictable and avoids wasting computation on questions where extra examples are unlikely to help. On five text classification tasks, DiSP improved average accuracy over strong example-selection methods while reducing total running time by up to 23 times. The work suggests that practical language-model systems can benefit more from quickly judging candidate examples than from exhaustively searching for the best ones.
Primary Area: Deep Learning->Large Language Models
Keywords: large language model, in-context learning, demonstration selection
Originally Submitted PDF: pdf
Submission Number: 5117
Loading