Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Published: 13 Jun 2025, Last Modified: 13 Jun 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Though Large Language Models (LLMs) have demonstrated remarkable capabilities, they are still prone to outputting falsehoods using seemingly persuasive language. Many recent works attempt to address this problem by using LLMs in a framework where a single seed prompt results in a series of interactions involving augmented prompts with an otherwise unchanged LLM, and the results are aggregated with a goal of producing a more reliable output. We consider the replicability and generalizability of evaluations of inference-time methods intended to improve the reliability of responses from a base LLMs. We survey how methods have been evaluated in the literature and find a great variety of benchmarks and models in use. Motivated by this, we conduct our own evaluation to evaluate the effectiveness of a few methods across a range of benchmarks and models. Our evaluation reveals that while these techniques show promise in improving reliability, there is still significant variability in performance across different domains and tasks, and methods that show substantial improvements on weaker base models often do not improve reliability for better base models.
Certifications: Reproducibility Certification, Survey Certification
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ## 1. Strengthen Connection Between Literature Analysis and Reproduction Study > _Improve organization building stronger connection between literature analysis and reproduction study. Explain how literature findings inform reproduction study (Reviewer bvGn)._ **Changes:** - **Added bridging paragraph**: End of Section 3.3 (Models), connects literature findings to experimental design: "Our analysis highlights key gaps shaping our experimental design. Fragmentation of benchmark usage motivates evaluation across diverse tasks. Variability in model choice suggests reported gains may hinge on specific traits. Underuse of strong, modern models motivates our focus on contemporary state-of-the-art systems." **Location**: End Section 3.3, page 7 ## 2. Paper Citation Analysis and Model Name Standardization > _Provide paper citation analysis and rigorous model name standardization to enhance accuracy and consistency._ **Changes:** - **Model standardization**: Consolidated model variants (base, chat, instruct) within each family and size. - **Literature evaluation rewrite**: Fixed anachronisms and corrected publication dates. - **Enhanced accuracy**: Improved consistency through rigorous data extraction methodology. **Location**: Section 3, specifically 3.1 (Literature selection), 3.3 (Models). ## 3. Distinguish from Other Existing Work > _Provide clear discussion distinguishing from existing work: Welleck et al., 2024; Chu et al., 2024; Sprague et al., 2024._ **Changes:** - **Added Chu et al. distinction**: Distinguished their taxonomy/categorization focus from our evaluation focus. - **Enhanced Welleck et al. distinction**: "Authors include methods requiring additional training or external tools, whereas we focus on methods working with any LLM without modification". - **Strengthened Sprague et al. distinction**: Emphasized our unique combination of literature analysis with empirical evaluation. **Location**: Section 2 (Related Work), pages 2-3. ## 4. Justify GPT-3.5 Choice > _Explain GPT-3.5 choice to justify experimental design for reproducibility and cost analysis._ **Changes:** - **Added explanation**: Majority of original models (text-ada-001, text-babbage-001, code-davinci-001, text-davinci-002, text-davinci-003) deprecated by OpenAI; others like LaMDA and PaLM inaccessible. - **Justified selection**: GPT-3.5-turbo emerged as only model offering necessary capabilities and long-term availability. - **Cost analysis rationale**: Focus on quantifying computational requirements by API calls, with GPT-3.5-turbo as baseline. - **Validated assumptions**: Experiments confirm relative cost patterns between methods remain consistent despite absolute variations. **Location**: Section 4.2 (Setup) and Appendix B. ## 5. Add Actionable Recommendations > _Add actionable recommendations addressing performance variation limitation._ **Changes:** - **Added recommendations**: - **Evaluate across diverse models** - methods improving weaker models often underperform on stronger models. - **Select representative benchmark suite** - no single benchmark captures performance across domains. - **Prioritize benchmark diversity over quantity** - narrow evaluation risks overfitting. - **Adopt standardized protocols** - evaluation variation undermines comparability. - **Connected to findings**: Tied recommendations to observed research fragility. **Location**: Section 6 (Discussion), pages 15-16 ## 6. Expand Evaluation to Other Domains > _Add results expanding evaluation to other domains, such as medical and legal._ **Changes:** - **Added medical domain**: Included MedQA dataset with multiple-choice questions from medical board exams across three languages. - **Added legal domain**: Included LegalBench "privacy_policy_qa" subset focusing on privacy policy question answering from 162 legal reasoning tasks. - **Broadened coverage**: Expanded beyond standard benchmarks to specialized professional domains. **Location**: Section 4.2 (Setup), page 10, results in Section 5 and Appendix D.
Code: https://github.com/mmjerge/LLM-Evaluation-Framework
Assigned Action Editor: ~Rui_Zhang7
Submission Number: 4178
Loading