PrEAM: Prompt Optimization using Evaluations by Automated Micro-judges

ACL ARR 2025 July Submission103 Authors

22 Jul 2025 (modified: 29 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Prompt quality plays an important role in the performance of LLM-powered QA (Question-Answering) systems. However, maintaining high-quality prompts remains a labor-intensive and fickle task. We introduce \textbf{PrEAM}, the first \emph{continual} prompt optimization framework for QA tasks that makes use of automated LLM-as-judge feedback. PrEAM closes the loop between generation, evaluation and prompt improvement by processing and leveraging feedback from one or more specialized, LLM‑based \emph{micro‑judges} that independently score every answer on each turn. For example, task‑critical axes might include faithfulness, relevance, completeness, conciseness, among others. Within each micro-judge, errors are investigated and classified based on root cause. The top errors for each micro-judge are then aggregated into targeted edits of the system prompt. This process is repeated until the performance of the train and test set diverges, producing a self‑healing system prompt that adapts as the knowledge base, user mix, or model version evolves. Using GPT 4o and GPT 4.1 on a dataset of 400 multi-turn QA tasks and an Arena-Hard dataset, respectively, shows marked improvement in just a handful of iterations while only requiring a few minutes to run.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Prompt Optimization,LLM-as-a-Judge,Automated Evaluation,Question Answering,Self-healing Prompts
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: 9
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: "References"
B2 Discuss The License For Artifacts: No
B2 Elaboration: No. We did not see a need to discuss the license as they are relatively standard an allow use of the artifacts in work like ours.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: No. We did not see a need to discuss the license as they are relatively standard an allow use of the artifacts in work like ours.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We did not see a need to discuss it as none of the data used in our work contains anything related to personal information.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: We note the language and source of our data in Section 5.
B6 Statistics For Data: Yes
B6 Elaboration: Section 6
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: The LLMs used in the paper are all available to the public. No training was done.
C2 Experimental Setup And Hyperparameters: N/A
C2 Elaboration: The LLMs used in the paper are all available to the public. Since no training was done as a part of this paper, there are no hyperparameters to tune.
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 6
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 103
Loading