The AI Barrister Flight Simulator: A Neuro-Symbolic Benchmark for Structured Legal Reasoning
Track: long paper (up to 10 pages)
Keywords: legal reasoning, neuro-symbolic benchmark, legal knowledge graph, KG-RAG, symbolic controller, jurisdiction constraints, temporal precedent, citation networks, doctrinal tests, multi-hop citation QA, multi-query consistency, hallucination measurement, constraint violation rate, path alignment, node coverage, structure-aware metrics, post-hoc consistency checking, retrieval orchestration, auditable LLM systems, legal NLP evaluation, knowledge graph QA, reasoning failure modes
TL;DR: Neuro-symbolic legal benchmark scoring LLMs on jurisdiction, temporal precedent, doctrine structure, and multi-query consistency using a Legal Knowledge Graph + controller; KG-RAG cuts hallucinations ~27× while boosting accuracy overall.
Abstract: Large Language Models (LLMs) deployed in legal settings produce fluent but structurally unreliable reasoning: they hallucinate authorities, violate jurisdictional boundaries, and ignore temporal precedent chains. We introduce the AI Barrister Flight Simulator, a neuro-symbolic benchmark that evaluates how an LLM reasons over legal structure rather than merely whether it reaches the correct answer. The benchmark couples a Legal Knowledge Graph (LKG) encoding statutes, case law, doctrinal tests, and citation networks with a symbolic controller that orchestrates retrieval, generation, and post-hoc consistency checking. Five task families (multi-hop citation, jurisdiction-constrained, temporal validity, doctrine-structure, and multi-query consistency) and four structure-aware metrics—Constraint Violation Rate (CVR), Hallucination Rate (HAR), Path Alignment (PA), and Node Coverage (NC)—expose failure modes invisible to accuracy alone. On a 50-scenario suite evaluated across three seeds, our KG-RAG pipeline achieves 98.0% accuracy with HAR = 0.005 and PA = 0.830, versus 77.3% accuracy and HAR = 0.138 for a baseline LLM. The full KG-RAG+Controller further reduces HAR to 0.003 and CVR to 0.289. Correlation analysis reveals that PA and NC are significant predictors of correctness (r=0.259 and r=0.302 respectively); a logistic model combining CVR, PA, and NC predicts answer correctness with 98.0% accuracy. Code, LKG, scenario library, and evaluation scripts will be released upon acceptance.
Presenter: ~David_Scott_Lewis1
Format: Yes, the presenting author will definitely attend in person because they attending ICLR for other complementary reasons.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 141
Loading