Prompt Engineering Techniques for Language Model Reasoning Lack Replicability

Laurène Vaugrante; Mathias Niepert; Thilo Hagendorff

Prompt Engineering Techniques for Language Model Reasoning Lack Replicability

Laurène Vaugrante, Mathias Niepert, Thilo Hagendorff

Published: 09 Dec 2025, Last Modified: 09 Dec 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) are integrated into everyday applications, research into prompt engineering techniques (PET) to improve these models’ behavior has surged. How- ever, clear methodological guidelines for evaluating these techniques are lacking. This raises concerns about the replicability and generalizability of the prompt engineering techniques’ benefits. We support our concerns with a series of replication experiments focused on zero- shot prompt engineering techniques purported to influence reasoning abilities in LLMs. We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, and BLOOM on the chain-of-thought, EmotionPrompting, Sandbagging, Re-Reading, Rephrase- and-Respond (RaR), and ExpertPrompting prompt engineering techniques. We applied them on manually double-checked subsets of reasoning benchmarks including Common- senseQA, CRT, NumGLUE, ScienceQA, and StrategyQA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research. To counter these issues, we propose recommendations for establishing sound benchmarks, and designing rigorous exper- imental frameworks to ensure accurate and reliable assessments of model outputs.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=Ty09cYE7pE&noteId=Ty09cYE7pE

Changes Since Last Submission: The Recommendations section has been expanded with additional references that substantiate our claims and now includes practical resources, such as tools, transparency guidelines, and a modular code pipeline, to support actionable mitigation. Some formatting issues have also been corrected.

Code: https://github.com/Laurene-v/replicatingPET

Assigned Action Editor: ~Sheng_Li3

Submission Number: 5653

Loading