Unified Plan Verification with Static Rubrics and Dynamic Policies for Reliable LLM Planning

Unified Plan Verification with Static Rubrics and Dynamic Policies for Reliable LLM Planning

ICLR 2026 Conference Submission20862 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Prompt Optimization, Large Language Models, Planning, Natural Language Processing

TL;DR: We propose a verify-then-control framework for tool-using LLM agents: task-specific rubrics statically screen and repair plans (SVR), and a learned rulebook guides execution (accept/next/alt/skip/backtrack) for auditable, reliable results.

Abstract: Large language model (LLM) agents can decompose tasks, call tools, and execute multi-step plans, yet they frequently fail for two reasons: (i) pre-execution plans look plausible but are incomplete, inconsistent, or ill-posed; and (ii) during execution, tool outputs reveal conflicts or policy violations that the agent neither detects nor repairs. Existing "LLM-as-judge" scoring is unstable and opaque, while reactive agents lack grounded, learnable control. We introduce \ours, a VERification-Aware planning infrastructure that inserts explicit checks both before and during execution. First, Static Verification via Rubrics (SVR) instantiates an instance-specific, binary checklist from a general taxonomy (completeness, correctness, executability), yielding auditable, stable decisions and actionable feedback for plan revision. Second, a Dynamic Verification Policy (DVP) enforces run-time control: a prompt-optimized rulebook (learned via MCTS-style discrete search, no weight updates) consumes the step context and tool outputs to emit symbolic actions---e.g., browse more candidates, switch tool, skip, backtrack, or accept. \ours is representation-agnostic and applies to structured plans with schemas/tools, unstructured conversational plans, and natural-language plans without tools. Across three regimes, \ours consistently improves task success and constraint satisfaction over strong prompting and agent baselines, reduces temporal/budget and policy violations, and provides rubric-level diagnostics that localize errors. Ablations show SVR (pre-execution screening) and DVP (execution-time control) are complementary; learned rulebooks outperform human-written heuristics with modest extra compute. We release prompts, rulebooks, and evaluation code to facilitate verification-aware agent research.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20862

Loading