Unified Plan Verification with Static Rubrics and Dynamic Policies for Reliable LLM Planning

ICLR 2026 Conference Submission20862 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Prompt Optimization, Large Language Models, Planning, Natural Language Processing
TL;DR: We propose a verify-then-control framework for tool-using LLM agents: task-specific rubrics statically screen and repair plans (SVR), and a learned rulebook guides execution (accept/next/alt/skip/backtrack) for auditable, reliable results.
Abstract: Large language model (LLM) agents can decompose tasks, call tools, and execute multi-step plans, yet they frequently fail for two reasons: (i) pre-execution plans look plausible but are incomplete, inconsistent, or ill-posed; and (ii) during execution, tool outputs reveal conflicts or policy violations that the agent neither detects nor repairs. Existing "LLM-as-judge" scoring is unstable and opaque, while reactive agents lack grounded, learnable control. We introduce \ours, a VERification-Aware planning infrastructure that inserts explicit checks both before and during execution. First, Static Verification via Rubrics (SVR) instantiates an instance-specific, binary checklist from a general taxonomy (completeness, correctness, executability), yielding auditable, stable decisions and actionable feedback for plan revision. Second, a Dynamic Verification Policy (DVP) enforces run-time control: a prompt-optimized rulebook (learned via MCTS-style discrete search, no weight updates) consumes the step context and tool outputs to emit symbolic actions---e.g., browse more candidates, switch tool, skip, backtrack, or accept. \ours is representation-agnostic and applies to structured plans with schemas/tools, unstructured conversational plans, and natural-language plans without tools. Across three regimes, \ours consistently improves task success and constraint satisfaction over strong prompting and agent baselines, reduces temporal/budget and policy violations, and provides rubric-level diagnostics that localize errors. Ablations show SVR (pre-execution screening) and DVP (execution-time control) are complementary; learned rulebooks outperform human-written heuristics with modest extra compute. We release prompts, rulebooks, and evaluation code to facilitate verification-aware agent research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20862
Loading