The AI Barrister Flight Simulator: A Neuro-Symbolic Benchmark for Structured Legal Reasoning

David Scott Lewis; Enrique Zueco; Haley Yi

The AI Barrister Flight Simulator: A Neuro-Symbolic Benchmark for Structured Legal Reasoning

David Scott Lewis, Enrique Zueco, Haley Yi

Published: 08 Mar 2026, Last Modified: 30 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: legal reasoning, neuro-symbolic benchmark, legal knowledge graph, KG-RAG, symbolic controller, jurisdiction constraints, temporal precedent, citation networks, doctrinal tests, multi-hop citation QA, multi-query consistency, hallucination measurement, constraint violation rate, path alignment, node coverage, structure-aware metrics, post-hoc consistency checking, retrieval orchestration, auditable LLM systems, legal NLP evaluation, knowledge graph QA, reasoning failure modes

TL;DR: Neuro-symbolic legal benchmark scoring LLMs on jurisdiction, temporal precedent, doctrine structure, and multi-query consistency using a Legal Knowledge Graph + controller; KG-RAG cuts hallucinations ~27× while boosting accuracy overall.

Abstract: Large Language Models (LLMs) deployed in legal settings produce fluent but structurally unreliable reasoning: they hallucinate authorities, violate jurisdictional boundaries, and ignore temporal precedent chains. We introduce the AI Barrister Flight Simulator, a neuro-symbolic benchmark that evaluates how an LLM reasons over legal structure rather than merely whether it reaches the correct answer. The benchmark couples a Legal Knowledge Graph (LKG) encoding statutes, case law, doctrinal tests, and citation networks with a symbolic controller that orchestrates retrieval, generation, and post-hoc consistency checking. Five task families (multi-hop citation, jurisdiction-constrained, temporal validity, doctrine-structure, and multi-query consistency) and four structure-aware metrics—Constraint Violation Rate (CVR), Hallucination Rate (HAR), Path Alignment (PA), and Node Coverage (NC)—expose failure modes invisible to accuracy alone. On a 50-scenario suite evaluated across three seeds, our KG-RAG pipeline achieves 98.0% accuracy with HAR = 0.005 and PA = 0.830, versus 77.3% accuracy and HAR = 0.138 for a baseline LLM. The full KG-RAG+Controller further reduces HAR to 0.003 and CVR to 0.289. Correlation analysis reveals that PA and NC are significant predictors of correctness (r=0.259 and r=0.302 respectively); a logistic model combining CVR, PA, and NC predicts answer correctness with 98.0% accuracy. Code, LKG, scenario library, and evaluation scripts will be released upon acceptance.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 141

Loading