Keywords: Applications of interpretability, Probing, Understanding high-level properties of models
TL;DR: We analyze how fine-tuning affects large language models’ internal representations, showing that domain-specific tuning reorganizes geometry, increasing isotropy and destabilizing error-detection signals.
Abstract: Large language models (LLMs) often know the correct answer internally even when their expressed output is wrong, which raises questions about how this knowledge is represented and whether domain adaptation changes it. We study how continued pretraining on domain corpora affects what a model knows and how reliably it can use this knowledge, with a focus on biomedical data. Comparing a general-purpose LLM with a clinical LLM obtained through continued pretraining on clinical text, we find that both retain similar levels of probe-accessible factual knowledge, yet the stability of self-monitoring signals is substantially reduced after domain pretraining. For example, the variance of error-detection performance nearly doubles in the biomedical model. An analysis of embedding geometry suggests that this reduced stability is associated with representations becoming more isotropic, with anisotropy decreasing from about 0.47 to 0.37. These results indicate that continued domain pretraining tends to reorganize rather than expand what the model knows, and can unintentionally weaken the consistency of error-detection signals, with implications for building reliable domain-adapted LLMs.
Submission Number: 306
Loading