Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

ACL ARR 2026 May Submission15830 Authors

26 May 2026 (modified: 12 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ICD coding, medical coding, post-training, reinforcement learning

Abstract: Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at \url{https://anonymous.4open.science/r/LLM4ICD}.

Paper Type: Long

Research Area: Clinical and Biomedical Applications

Research Area Keywords: clinical coding, clinical and biomedical language models

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 15830

Loading