From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding

ACL ARR 2026 January Submission72 Authors

21 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ICD Coding, LLM
Abstract: ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model’s ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level classification improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which effectively improve coding performance on ICD codes out of domain while preserving interpretability. With only the open-source Llama-3.1-8B model, our method outperforms or matches strong discriminative baselines and GPT-4.1–based generative methods, demonstrating its effectiveness and potential for fully automated ICD coding. Code is available at https://anonymous.4open.science/r/CCL-ICD.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: clinical coding
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 72
Loading