LOOK BEFORE YOU LEAP: THERMODYNAMIC ARBI- TRATION OF PARAMETRIC AND NON-PARAMETRIC KNOWLEDGE IN LLM AGENTS VIA SELF- REGULATING MEMORY ARCHITECTURES

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: Agentic Robustness, Hallucination Mitigation, Adversarial Context Defense, Safety-Aware Retrieval, Context Poisoning, Metacognitive Regulation
TL;DR: MARTA acts as a thermodynamic firewall for autonomous agents, utilizing internal uncertainty status to detect and reject adversarial 'context poisoning' attacks with 87.6% efficacy, ensuring safe deployment in open-world environments
Abstract: As Large Language Model (LLM) agents are deployed in ”wild” environments, they face the critical threat of Context Poisoning—where irrelevant or adversarial retrieval results induce hallucinations and derail reasoning. Current ”Open-Loop” agents, which retrieve indiscriminately (P (act) ≈ 1), lack the immune system to reject these toxic inputs. In this work, we introduce a safety-critical control layer MARTA (Metacognitive Adaptive Retrieval and Thought Architecture), which establishes a Thermodynamic Firewall between the agent and its memory. We model the decision to ingest external context not as a default behavior, but as a risk-aware arbitration based on the frozen backbone’s “Epistemic Signature” u(x). We introduce the Discriminative Cliff metric, which quantifies an agent’s ability to distinguish between high-similarity distractors and high-utility evidence. Our evaluation on the Adversarial Alignment Protocol demonstrates that MARTA achieves a discriminative cliff of +87.4, rejecting 87.6% of adversarial traps that successfully poisoned baseline agents. By forcing the model to ”Look (at its own uncertainty) Before It Leaps (into external data),” MARTA provides the necessary epistemic regulation for robust, safe, and reliable autonomous operation.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 94
Loading