PrEAM: Prompt Optimization using Evaluations by Automated Micro-judges

PrEAM: Prompt Optimization using Evaluations by Automated Micro-judges

ACL ARR 2025 July Submission103 Authors

22 Jul 2025 (modified: 29 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Prompt quality plays an important role in the performance of LLM-powered QA (Question-Answering) systems. However, maintaining high-quality prompts remains a labor-intensive and fickle task. We introduce \textbf{PrEAM}, the first \emph{continual} prompt optimization framework for QA tasks that makes use of automated LLM-as-judge feedback. PrEAM closes the loop between generation, evaluation and prompt improvement by processing and leveraging feedback from one or more specialized, LLM‑based \emph{micro‑judges} that independently score every answer on each turn. For example, task‑critical axes might include faithfulness, relevance, completeness, conciseness, among others. Within each micro-judge, errors are investigated and classified based on root cause. The top errors for each micro-judge are then aggregated into targeted edits of the system prompt. This process is repeated until the performance of the train and test set diverges, producing a self‑healing system prompt that adapts as the knowledge base, user mix, or model version evolves. Using GPT 4o and GPT 4.1 on a dataset of 400 multi-turn QA tasks and an Arena-Hard dataset, respectively, shows marked improvement in just a handful of iterations while only requiring a few minutes to run.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Prompt Optimization,LLM-as-a-Judge,Automated Evaluation,Question Answering,Self-healing Prompts

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: 9

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: "References"

B2 Discuss The License For Artifacts: No

B2 Elaboration: No. We did not see a need to discuss the license as they are relatively standard an allow use of the artifacts in work like ours.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: No. We did not see a need to discuss the license as they are relatively standard an allow use of the artifacts in work like ours.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: We did not see a need to discuss it as none of the data used in our work contains anything related to personal information.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: We note the language and source of our data in Section 5.

B6 Statistics For Data: Yes

B6 Elaboration: Section 6

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: The LLMs used in the paper are all available to the public. No training was done.

C2 Experimental Setup And Hyperparameters: N/A

C2 Elaboration: The LLMs used in the paper are all available to the public. Since no training was done as a part of this paper, there are no hyperparameters to tune.

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 6

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 103

Loading