PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

ACL ARR 2026 January Submission7402 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agent Evaluation, Prompt Refinement, Query Refinement
Abstract: Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, hallucination, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs diverse rewrites to explore nearby query variations, while the prompt refinement module uses prior feedback to derive new effective objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23%–78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Evaluation, Agent Evaluation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7402
Loading