Structured Hallucination in Tool-Using Agents: Measuring and Mitigating LLM Synthesis Corruption in Production

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent safety, structured output, hallucination, tool use, Server-Sent Events, LLM agents, structured generation, production deployment, reliability, evaluation
TL;DR: LLM agents silently corrupt 11–13% of fields when synthesizing structured tool outputs into text; routing tool outputs through typed Server-Sent Events bypasses synthesis entirely, achieving 100% fidelity at 3.8× lower latency.
Abstract: When LLM agents synthesize structured tool outputs into markdown or JSON for the user, they can silently corrupt field values — wrong times, abbreviated names, truncated rows — with no error signal. We call this **structured hallucination** and measure it across $200$ trials ($50$ queries $\times$ $4$ conditions) in a production travel-agent deployment against frozen real API ground truth (SerpAPI, Seats.aero). To isolate the synthesis step from constraint-following, the agent receives the tool-output flight list and a fixed presentation prompt rather than the natural-language query; we therefore frame the study as a production-derived synthesis benchmark, not a full end-to-end production-task evaluation. In our benchmark, the three LLM-mediated formats — Markdown, JSON-block, and native `response_schema` — are aggregate-indistinguishable ($0.87$–$0.89$ mean field fidelity; $11$–$13\\%$ cell-level corruption under the field-fidelity metric) but diverge on the largest tail payloads into three distinct failure modes: silent truncation with completeness mismatch, deterministic streaming-budget disconnect, and (only for `response_schema`) survival at $1.5\times$ the latency. We characterize **SSE Metadata Injection** — an established but unquantified pattern that delivers tool outputs as typed Server-Sent Events alongside the LLM token stream, bypassing synthesis entirely. In this deployment it achieves **$100\\%$ fidelity by construction** at **$3.8\times$ lower latency than Markdown**, with no training, model, or dependency changes. The strongest evidence for divergence comes from the tail-payload regime; aggregate metrics conceal it.
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 39
Loading