Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Applications of interpretability, Probing, Understanding high-level properties of models
TL;DR: We analyze how fine-tuning affects large language models’ internal representations, showing that domain-specific tuning reorganizes geometry, increasing isotropy and destabilizing error-detection signals.
Abstract: Recent studies demonstrate that large language models often encode correct answers internally even when their outputs are incorrect, and that lightweight probes can recover these latent signals. This work extends such analyses to compare general-purpose and biomedical domain–specialized models. Across circular, logistic, and MLP probes, both models exhibit low probe accuracy on internal and external knowledge, but strong error-detection performance in deeper layers. The key difference lies in stability: probe performance in the biomedical model is markedly more variable, with nearly double the standard deviation in error detector F1 compared to the general model (e.g., 0.0742 vs. 0.0510 for circular probes). An isotropy analysis provides a complementary explanation. The general model displays higher anisotropy (baseline similarity = 0.4667), producing stable, linearly separable correctness signals, whereas the biomedical model exhibits greater isotropy (baseline similarity = 0.3737), coinciding with noisier probe behavior. These findings suggest that domain-specific finetuning does not destroy or add probe-accessible knowledge, but rather reorganizes representational geometry in ways that reduce the stability of error-detection signals. The results here indicate that increased isotropy may undermine robustness in self-monitoring.
Submission Number: 306
Loading