Prompt Optimization Improves Robustness of Language Model Benchmarks for Medical Tasks
Keywords: Prompt Optimization, Language Models, Medical Benchmarks
TL;DR: Scalable and automated prompt optimization enables more robust, decision-useful benchmarking of language models for medical tasks.
Track: Findings
Abstract: While language model (LM) benchmarking frameworks such as MedHELM enable holistic evaluation across medical tasks, their leaderboards often rely on a fixed prompt per benchmark, without invoking step-by-step reasoning. Furthermore, it is known that fixed prompts may not generalize well across LMs, potentially yielding unrepresentative estimates. Unless we estimate each LM's performance ceiling, we risk underestimating performance. Prompting frameworks, such as DSPy, offer a scalable alternative to prompt engineering for alleviating this challenge. We present a framework that integrates DSPy with MedHELM, introducing structured prompting that elicits reasoning. Using four prompting methods, we evaluate four LMs across four benchmarks against MedHELM's leaderboard. We find that structured prompting: (i) outperforms the MedHELM baseline (+3.2\% absolute across-benchmark average), (ii) reduces variance associated with prompt design (-2.9\% absolute across-benchmark standard deviation), (iii) alters performance gaps (flips LM rankings on one benchmark), and (iv) demonstrates that LMs with chain-of-thought are relatively insensitive to prompt design. We conclude that scalable and automated performance ceiling estimation enables more robust medical benchmarks.
General Area: Applications and Practice
Specific Subject Areas: Evaluation Methods & Validity
PDF: pdf
Data And Code Availability: No
Ethics Board Approval: No
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Code URL: https://github.com/StanfordMIMI/dspy-helm
Submission Number: 65
Loading