TL;DR: We propose CROQ, which improves LLM accuracy by refining MCQ answer choices using conformal prediction. To enhance CROQ, we introduce CP-OPT, optimizing scores for smaller prediction sets. Experiments show CROQ is effective, especially with CP-OPT.
Abstract: Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, incorrect outputs pose significant risks in high-stakes domains like healthcare and finance. To quantify LLM uncertainty and thereby mitigate these risks, recent works employ conformal prediction (CP), a model- and distribution-agnostic framework that uses LLM outputs to generate a \emph{prediction set} containing the true answer with high probability. Leveraging CP, we propose \emph{conformal revision of questions} (CROQ), which revises the question by narrowing down the available choices to those in the prediction set and asking the LLM the revised question. We expect LLMs to be more accurate on revised questions with fewer choices. Furthermore, we expect CROQ to be effective when the prediction sets from CP are small. Commonly used logit scores often lead to large sets, diminishing CROQ's effectiveness. To overcome this, we propose CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Our extensive experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with multiple LLMs show that CROQ improves accuracy over the standard inference, with more pronounced gains when paired with CP-OPT.
Lay Summary: Large language models (LLMs) are increasingly used to make decisions in tasks like answering multiple-choice questions or selecting tools in software systems. But they may make mistakes, often with high confidence — a risky trait in areas like healthcare or finance. Our paper introduces a method called CROQ (Conformal Revision of Questions) that helps LLMs make better decisions by narrowing down their options before they answer. Inspired by the human test-taking strategy of eliminating obviously wrong choices, CROQ uses a statistical technique called conformal prediction to remove unlikely answers, then re-asks the question with just the remaining options. With fewer choices, the model is more likely to choose correctly. However, how many options are pruned depends on the quality of the scoring used in conformal prediction — and standard scores aren't very efficient. To fix this, we propose CP-OPT, an optimized way to score options that removes more irrelevant answers while still ensuring the correct one is likely to stay. Across a variety of benchmarks and models, our approach consistently improves accuracy, offering a practical way to make LLMs more trustworthy in high-stakes settings.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Conformal Prediction, Uncertainty Quantification, Prompting, MCQ, Tool Learning, Agentic AI, Test-time Scaling
Submission Number: 14662
Loading