Keywords: Diagnostic Failure Paradigm, Causal Identification, Plausibility Trap, Verified Agentic Pipeline (VAP), Climate Science, Closed-Loop Identification, System Identification, Frequency-Domain Analysis, Physics-Informed Machine Learning (PIML), Causal Benchmarking
TL;DR: AI systems in science often produce results that seem plausible but are physically and causally wrong. This paper introduces a method that analyzes how and why models fail in order to reveal deep truths about a system's causal structure.
Abstract: Autonomous AI systems in the applied sciences frequently fall into the "plausibility trap," generating coherent outputs that violate fundamental physical laws or causal structures. This is particularly hazardous in complex, nonlinear domains like climate science, where identifying causal relationships is critical for policy decisions. We introduce the Verified Agentic Pipeline (VAP), an architecture designed to enforce physical and causal constraints during autonomous discovery. Central to our approach is the "diagnostic failure paradigm." Rather than simply discarding models that fail, we meticulously analyze how they fail to extract diagnostic information about the system's causal behavior. We demonstrate this using frequency-domain system identification on a closed-loop climate intervention simulation (NCAR GLENS). Our analysis reveals a profound paradox: strong frequency-domain coherence (phase-locking) coexists with catastrophic time-domain failure ($R^2 = -4.35 \times 10^4$). This specific failure mode—getting the timing right but the magnitude catastrophically wrong—diagnoses the system as exhibiting linear phase-locking masked by nonlinear amplitude modulation and controller interference. This provides a rigorous, configuration-specific causal benchmark, motivating the need for advanced techniques (e.g., Koopman operators, Neural Operators) within a verified pipeline to handle such complex causal dynamics.
Submission Number: 54
Loading