Analysing Representations Through Layers: Token-Level Semantic Evolution in Clinical Language Model

ICLR 2026 Conference Submission21082 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: natural language processing, large language models, interpretability, transformers, transparency, healthcare
Abstract: Generative AI has significantly enhanced clinical decision-making and support for medical diagnosis. However, the black box nature of Large Language Models (LLMs), the lack of interpretability constrains their extensive use across clinical settings. This study develops and demonstrates a novel methodology that combines sparse autoencoders with token-level activation analysis to uncover and interpret layer-wise semantic evolution in clinical language model, enabling interpretable decision support in cancer text classification. The approach provides a representational interpretable technique to understand the underlying mechanisms of a domain-specific ClinicalBERT transformer to bridge the gap between the obscure nature of LLM model and human understandability. Sparse Autoencoders (SAEs) are employed to extract activation vectors and visualize the hidden embedding layers, offering deeper insights into how clinical concepts are encoded and transformed with the model. The experiments have been conducted using publicly available cancer text data as a case study on ClinicalBERT first four layers and last four layers, we observe steady advancements in feature adaptation on the last layers contained the task specific embeddings as compared to general feature adaptation in early layers. Lower layers capture syntactic and lexical patterns, while upper layers encode high-level clinical semantics. Whereas middle layers produced mixed, entangled representations making them unsuitable for stable token-level analysis. Therefore, we conducted a classification task using representation from first four and last four transformer layers to assess the interpretability of ClinicalBERT across its architecture. The model achieved 94\% classification accuracy, indicating deeper layers capture highly discriminative features crucial for decision-level tasks. In contrast, the early layers yielded only 24\% accuracy indicating limited representational capacity for such clinical classification. The key insights of these layers also demonstrate strong token level interpretability, reinforcing their empirical robustness in clinical applications.
Primary Area: interpretability and explainable AI
Submission Number: 21082
Loading