RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0
Keywords: Automated fault detection, LLM agentic systems, multi-component evaluation, iterative reasoning, Agentic evaluation, Automated evaluation, LLM as a judge, Tool calling, Long horizon evaluation
TL;DR: RAFFLES is an evaluation framework that improves automated identification and diagnosis of failures in complex, multi-component LLM agent systems by using a specialized, iterative pipeline to reason about and pinpoint the exact point of breakdown.
Abstract: We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems breakdown and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long-horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system's components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the Who\&When dataset, a benchmark designed to diagnose the "who" (agent) and "when" (step) of a system's failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43\% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6\%) and over 20\% on the Hand-Crafted dataset (surpassing the previously published best of 8.8\%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.
Submission Number: 75
Loading