Encoding Without Influence: Dissociating Demographic Representation from Causal Effect in Large Language Models
Abstract: Large language models are increasingly deployed in settings that require normative judg-
ment, yet the internal pathway by which demographic context shapes their outputs remains
uncharacterized. We apply sparse autoencoder feature extraction and causal interventions
(activation patching, feature steering, and targeted ablation) to Gemma 2 9B, Qwen 2.5
7B, and Llama 3.1 8B, tracing how demographic information is represented and used dur-
ing survey responses across five policy domains. We find that demographic representations
and demographic influence are localized in different parts of the network: early layers en-
code demographic identity but exert no measurable effect on outputs, while interventions on
late-layer features recover 68.7–75.8% of behavioral effects across architectures. Variance-
matched null baselines confirm that these effects are specific to demographic features rather
than a generic consequence of perturbation. We further show that demographic influence is
domain-modulated, with the ranking of influential demographics shifting across policy ar-
eas. The dissociation is demonstrated across two architectures (Gemma 2 9B, Qwen 2.5 7B)
with partial replication on a third (Llama 3.1 8B), with different encoding profiles and align-
ment procedures. These results suggest that representational detection alone is insufficient
for bias auditing, as the most detectable demographic encodings are not the ones driving
outputs, and that fairness evaluation must be both causally validated and domain-specific.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: All changes in the revised manuscript are highlighted in blue for ease of review.
Revision in response to reviewer comments. Substantive changes:
Cross-layer transplantation analysis added (new Appendix E) addressing Reviewer BxgT's depth-proximity confound concern (C1). Reports raw and norm-matched specificity across six layer-pair combinations in Gemma 2 9B; the analogous test for Qwen was not run for this revision.
Bootstrapped orthogonalisation analysis added (new Appendix F) addressing Reviewer BxgT's polysemantic bundling concern (C2). On-target features recover ~96% (Gemma) and ~85% (Qwen) of the behavioural effect after off-target demographic directions are removed.
Cross-architecture framing softened throughout (Abstract, §2.3, §5.2, §5.7, §6.3, Conclusion) addressing Reviewer BxgT's C3. Llama is now described as partial replication rather than three-way convergence on causal mediation.
Composite social axes analysis promoted from Appendix D.5 to a new main-paper subsection §5.6, with Figure 6 (Gemma encoding matrix) in the main text. Original Appendix D.5 retained with per-architecture detail.
Sequence-level scoring sensitivity analysis added (Appendix D.2) addressing Reviewer 3QWG's Point 3. Encoding–causal dissociation robust in aggregate (pooled shift −1.9pp); vote × values cell flagged as scoring-method-sensitive.
Encoding-presence / causal-mediation decomposition added (§1, §6.1) addressing Reviewer aEwm's discussion question, with concept-erasure framing introduced in §2.2 and §6.2.
Editorial fixes: PT–IT spacing in §4.1; Table 8 / §5.2 metric clarification moved adjacent to first mention; Bouchaud & Ramaciotti causal-layer clarification in §2.1; Cohen (1988) and Benjamini & Hochberg (1995) citations added; Figure 3A legend repositioned; compression pass across §1, §5, §6.
Detailed per-reviewer responses posted as replies to each review.
Assigned Action Editor: ~Shangtong_Zhang1
Submission Number: 8003
Loading