Abstract: As large language models (LLMs) are integrated into everyday applications, research into prompt engineering techniques (PET) to improve these models’ behavior has surged. How- ever, clear methodological guidelines for evaluating these techniques are lacking. This raises concerns about the replicability and generalizability of the prompt engineering techniques’ benefits. We support our concerns with a series of replication experiments focused on zero- shot prompt engineering techniques purported to influence reasoning abilities in LLMs. We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, and BLOOM on the chain-of-thought, EmotionPrompting, Sandbagging, Re-Reading, Rephrase- and-Respond (RaR), and ExpertPrompting prompt engineering techniques. We applied them on manually double-checked subsets of reasoning benchmarks including Common- senseQA, CRT, NumGLUE, ScienceQA, and StrategyQA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research. To counter these issues, we propose recommendations for establishing sound benchmarks, and designing rigorous exper- imental frameworks to ensure accurate and reliable assessments of model outputs.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=Ty09cYE7pE¬eId=Ty09cYE7pE
Changes Since Last Submission: I have updated some formatting issues.
Assigned Action Editor: ~Sheng_Li3
Submission Number: 5653
Loading