A $\texttt{Min-p}$ Blueprint for More Rigorous Science in Empirical Machine Learning Research

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, sampling, samplers, min-p, large language models, evaluations, reproducibility, peer review, ML conferences
TL;DR: Min-p sampling does not improve quality, diversity or a trade-off between the two.
Abstract: In light of a growing crisis of rigor in empirical machine learning research, this paper provides a blueprint for conducting more meticulous science. We present a detailed case study of "Turning Up the Heat: $\texttt{Min-P}$ Sampling for Creative and Coherent LLM Outputs" (Nguyen et al. 2024), a high-visibility ICLR 2025 Oral paper that introduced a new method for sampling from language models called $\texttt{min-p}$. The original work claimed that $\texttt{min-p}$ sampling achieves superior quality and diversity over established methods. However, our comprehensive re-examination of the original paper's four main lines of evidence demonstrates that its conclusions are invalidated by its own data. Our re-analysis reveals that: (1) The original human evaluations omitted one-third of the collected data, applied statistical tests incorrectly, and inaccurately described qualitative feedback; a correct analysis shows $\texttt{min-p}$ did not outperform baselines. (2) Extensive hyperparameter sweeps on NLP benchmarks show $\texttt{min-p}$'s claimed superiority vanishes when controlling for the volume of hyperparameter tuning. (3) The LLM-as-a-Judge evaluations suffered from methodological ambiguity and appear to have reported results inconsistently, favoring $\texttt{min-p}$. (4) Claims of widespread community adoption were found to be unsubstantiated and were retracted. From this case study, we derive a blueprint for more rigorous research. Key lessons include the critical need to compare methods fairly by controlling for hyperparameter tuning, to apply statistical tests transparently and correctly (e.g., correcting for multiple comparisons), to practice full data transparency, and to scrutinize qualitative summaries, methodological clarity, and potentially selective reporting. Adhering to these principles is essential for ensuring the validity of scientific claims and fostering genuine progress in the field of machine learning research.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22905
Loading