Owl: Separating Generation from Evaluation to Detect Plausible Failures in Lifecycle Inventory Mapping
Keywords: Multi-agent systems, Entity matching, Reliable AI, Error detection, LLM-as-judge, Generation-evaluation separation, Lifecycle inventory mapping, Carbon accounting, Lifecycle assessment, Emissions factor recommendation
TL;DR: A two-agent architecture that separates generation from quality assessment to enable scalable corporate carbon accounting with human review concentrated where it matters.
Abstract: Deploying AI at scale requires detecting plausible-but-incorrect outputs before they compound into aggregate harm, especially in domains where wrong answers look defensible. We present a multi-agent architecture that separates generation from quality assessment, applied to the specific domain of mitigating climate change via corporate emissions. Our system, Owl, consists of a domain-specialized mapper that generates candidates using tool-augmented retrieval, and an independent judge that evaluates quality through asymmetric information access (no visibility into the mapper's reasoning or search process). This separation yields both higher accuracy and explicit signals for failure detection. We evaluate Owl on the task of lifecycle inventory mapping, an essential tool for corporate climate action. Effective climate action requires companies to accurately measure supply chain emissions, which often constitute over 40% of corporate carbon footprints. Yet this demands mapping tens to hundreds of thousands of purchased items to emission factor databases, which is manually intractable. The problem is entity matching under uncertainty: inputs are variably noisy, database coverage is incomplete, and incorrect mappings produce realistic emissions values without obvious error signals. Across 1,039 items, the domain-specialized mapper achieves 90.7% accuracy versus 68.3% for non-agentic baselines and 81.7% without domain-specific prompting. On less noisy inputs, it achieves 98.9% on a well-specified public benchmark. The judge enables efficient human oversight: at a 20% review budget, it captures 67% of errors, nearly double heuristic baselines. Our results demonstrate that multi-agent architectures with information barriers between generation and evaluation can improve reliability in domains where errors are plausible but consequential.
Submission Number: 61
Loading