Tracing the Latent Threads: A Mechanistic Study of How LLMs Encode and Operationalize Race and Ethnicity

Tracing the Latent Threads: A Mechanistic Study of How LLMs Encode and Operationalize Race and Ethnicity

ACL ARR 2026 January Submission6770 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, racial bias, large language models, probing, neuron analysis, clinical NLP, bias mitigation

Abstract: Large language models (LLMs) increasingly operate in high-stakes settings where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions steering such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Ethics, Bias, and Fairness, NLP Applications, Clinical and Biomedical Applications

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6770

Loading