Second Thoughts: Revisable Trailing Context in LLM-Grounded Clinical Speech

Dave Makhervaks; Vinesh R Gudla; Kabir Mahal; Aditya Prabhakar; Josh Cowdy; Eric R. Hunter; Raizy Leizerowski; Ryan Schwers; Madeline Grade; Tejaswi Tenneti

Second Thoughts: Revisable Trailing Context in LLM-Grounded Clinical Speech

Dave Makhervaks, Vinesh R Gudla, Kabir Mahal, Aditya Prabhakar, Josh Cowdy, Eric R. Hunter, Raizy Leizerowski, Ryan Schwers, Madeline Grade, Tejaswi Tenneti

Published: 10 Jun 2026, Last Modified: 16 Jun 2026KDD 2026 Workshop SciSoc Agents & LLMs OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM reasoning, clinical NLP, speech understanding, multi-source fusion, grounded generation, revisable context, LLM ensemble

TL;DR: Treat clinical speech as revisable evidence, not frozen text — fusing multiple ASR outputs over a rolling audio window with an LLM reconciler cuts medication-name edits 51% in deployed clinician use.

Abstract: Clinical dictation is the upstream signal for an expanding class of LLM-driven healthcare systems namely note-generation pipelines, clinical-decision-support agents, structured-data extractors. Errors in the transcript silently propagate through every downstream LLM call. We present a production pipeline whose main contribution is the combination of multi-source ASR ensembling with LLM reconciliation, in a rolling-window architecture. This gives the reconciler revisable trailing context where a frontier LLM fuses parallel transcripts from heterogeneous fine-tuned ASR backends grounded in structured patient context (medications, problems, clinician identity) through a sliding $K$-buffer audio re-pass. This lets it correct prior output as new audio arrives. Unlike other multi-ASR fusion work, no word-level voting or confusion-network construction is used in our approach. A prompt transfer ablation shows that the gain comes from the audio re-pass architecture improvements, rather than from how the reconciler is prompted. On a 226-encounter clinical dictation dataset, ensembling-plus-reconciliation is the dominant contribution: it reduces normalized keyword error by 40\% relative to a single fine-tuned backend with post-processing, with 95\% CIs. The rolling-window variant adds a further improvement across raw and normalized metrics ($16$--$26\%$ relative vs.\ a non-revisable mini-batch ensemble; all $p<10^{-3}$ under encounter-level bootstrap). In deployed clinician use, medication-name correction edit rate dropped 51\% ($12.21\%\to6.04\%$) post-rollout. We additionally characterize what the LLM-in-the-loop reconciler is sensitive to (structured grounding, multi-source diversity) and what it is not (prompt engineering alone, single-backend configurations). We discuss implications for the broader design of LLM systems that operate on noisy upstream signals.

Submission Number: 17

Loading