Knowing How to Edit: Reliable Evaluation Signals for Diagnosing and Optimizing Prompts at Query Level
Abstract: Prompt optimization has become a central mechanism for eliciting strong performance from LLMs, and recent work has made substantial progress by proposing diverse prompt evaluation metrics and optimization strategies. Despite these advances, prompt evaluation and prompt optimization are often developed in isolation, limiting the extent to which evaluation can effectively inform prompt refinement. In this work, we study prompt optimization as a process guided by performance-relevant evaluation signals. To address the disconnect between evaluation and optimization, we propose an evaluation-instructed prompt optimization approach that explicitly connects prompt evaluation with query-dependent optimization. Our method integrates multiple complementary prompt quality metrics into a performance-reflective evaluation framework and trains an execution-free evaluator that predicts prompt quality directly from text, avoiding repeated model executions. These evaluation signals then guide prompt refinement in a targeted and interpretable manner. Empirically, the proposed evaluator achieves 83.7\% accuracy in predicting prompt performance. When incorporated into the optimization process, our approach consistently outperforms existing optimization baselines across eight benchmark datasets and three different backbone LLMs. Overall, our results demonstrate that reliable and efficient evaluation signals can serve as an effective foundation for robust and interpretable prompt optimization.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Amit_Sharma3
Submission Number: 6898
Loading