Keywords: ICD Coding, LLM
Abstract: ICD coding is a critical yet challenging task in healthcare.
Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model’s ability to generalize to unseen codes.
Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes.
Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive.
To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans.
The key idea of this framework is that span-level classification improves LLMs' ability to perform document-level ICD coding.
Our proposed framework consists of a mixed training strategy and code-centric data expansion, which effectively improve coding performance on ICD codes out of domain while preserving interpretability.
With only the open-source Llama-3.1-8B model, our method outperforms or matches strong discriminative baselines and GPT-4.1–based generative methods, demonstrating its effectiveness and potential for fully automated ICD coding.
Code is available at https://anonymous.4open.science/r/CCL-ICD.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: clinical coding
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 72
Loading