Owl: Separating Generation from Evaluation to Detect Plausible Failures in Lifecycle Inventory Mapping

Andrew Dumit; Krishna Rao; Shaena Ulissi; Steven Watson; P. James Joyce; Shuhan Bao; Jacob Feintzeig; Sangwon Suh

Owl: Separating Generation from Evaluation to Detect Plausible Failures in Lifecycle Inventory Mapping

Andrew Dumit, Krishna Rao, Shaena Ulissi, Steven Watson, P. James Joyce, Shuhan Bao, Jacob Feintzeig, Sangwon Suh

Published: 02 Mar 2026, Last Modified: 21 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-agent systems, Entity matching, Reliable AI, Error detection, LLM-as-judge, Generation-evaluation separation, Lifecycle inventory mapping, Carbon accounting, Lifecycle assessment, Emissions factor recommendation

TL;DR: A two-agent architecture that separates generation from quality assessment to enable scalable corporate carbon accounting with human review concentrated where it matters.

Abstract: Deploying AI at scale requires detecting plausible-but-incorrect outputs before they compound into aggregate harm, especially in domains where wrong answers look defensible. We present a multi-agent architecture that separates generation from quality assessment, applied to measuring corporate environmental impacts through lifecycle inventory mapping. Our system, Owl, consists of a domain-specialized mapper that generates candidates using tool-augmented retrieval, and an independent judge that evaluates quality without access to the mapper's reasoning. Effective climate action requires companies to accurately measure supply chain emissions, which often constitute over 40% of corporate carbon footprints. Yet this demands mapping tens to hundreds of thousands of purchased items to emission factor databases, which is manually intractable. The problem is entity matching under uncertainty: inputs are variably noisy, database coverage is incomplete, and incorrect mappings produce realistic emissions values without obvious error signals. Across 1,039 items, the domain-specialized mapper achieves 90.7% defensible accuracy versus 68.3% for non-agentic baselines and 81.7% without domain-specific prompting. On less noisy inputs, it achieves 98.9% on a well-specified public benchmark. The judge enables efficient human oversight: at a 20% review budget, it captures 67% of errors versus 37-40% for heuristic baselines. Our results demonstrate that multi-agent architectures that separate generation from evaluation can improve reliability in domains where errors are plausible but consequential.

Submission Number: 61

Loading