From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin; Usman Nasir Nakakana; Guillaume Chabot-Couture

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM, Graph, Evaluation, Test, Harness

TL;DR: We propose a graph-based evaluation harness that enables scalable, continuously refreshable benchmarking and reveals fine-grained capability gaps beyond static datasets.

Abstract: Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. These are properties that static, manually curated datasets cannot guarantee. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation questions via graph traversal. This approach provides three key guarantees: (1) complete coverage, ensuring all guideline relationships are evaluated; (2) surface-form contamination resistance, achieved through combinatorial question generation with randomized attributes; and (3) graph-level validity, inherited from expert-authored knowledge structures. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show substantially lower accuracy on treatment protocols and clinical management decisions. Beyond static benchmarking, the proposed framework enables continuous regeneration of evaluation data as guidelines evolve. This supports robust and updatable assessment of domain-specific AI systems. The methodology generalizes to any domain with structured decision logic and provides a scalable foundation for evaluation infrastructure with formal guarantees.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Archival

Submission Number: 78

Loading