Prompt Genotyping: Quantifying the Evaluation Gap Between Synthetic Benchmarks and Real LLM Performance
Keywords: prompt engineering, LLM evaluation, benchmark validation, predictability gap, failure prediction
TL;DR: Surface-level prompt features predict synthetic benchmark performance perfectly (R²=0.86) but fail completely on real LLM outputs (R²=-0.13), revealing that benchmark optimization doesn't transfer to deployment.
Abstract: LLM evaluation relies heavily on synthetic benchmarks, but how well do these predict real-world performance? We introduce Prompt Genotyping, a framework treating prompts as measurable ''genomes'' of 14 linguistic features to predict LLM ''phenotypes'' (performance outcomes). Using 1,112 real prompt-response pairs from MT-Bench and HELM plus 1,388 synthetic controls, we reveal a dramatic predictability gap: surface features explain 86\% of variance on algorithmic labels (R² = 0.86 ± 0.02) but achieve worse-than-random performance on authentic GPT-4o-mini outputs (R² = -0.134). This 1.0+ R² gap quantifies a fundamental challenge in the LLM evaluation methodology: Synthetic benchmark optimization may not be generalized to deployment scenarios. We establish the first leakage-free baseline for prompt failure prediction (F1=0.56, AUC=0.65) and release comprehensive evaluation resources to advance systematic, data-driven prompt assessment.
Submission Number: 215
Loading