Debiasing Through Circuits: A Reproducibility Study in Mechanistic Interpretability

TMLR Paper4355 Authors

25 Feb 2025 (modified: 12 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) achieve remarkable performance yet remain vulnerable to ad- versarial attacks. Mechanistic interpretability offers a promising avenue for diagnosing these weaknesses by identifying the circuits that drive model behavior. We reproduce and criti- cally assess the pipeline introduced by García-Carrasco et al. (2024), which uses activation patching, gradient-based adversarial attacks, and logit attribution to locate vulnerabilities in a synthetic acronym prediction task for GPT-2 small. While their approach provides an interesting toy example, we find incomplete circuit identification and limited adversarial effectiveness. To address these shortcomings, we apply edge attribution patching for more faithful circuit discovery, generalize their adversarial approach to multi-token inputs, and scale the analysis to a larger model, Llama-3.2-1B-Instruct, on a more complex and socially relevant task: toxicity detection with a focus on name-related biases. We further introduce Differential Circuit Editing (DICE) to demonstrate how targeted interventions in the identified circuits can mitigate harmful behavior without compromising task accuracy resulting in the bias reduction of 12.6% while slightly improving accuracy by 3.4%.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=a1xgiRlpXr&noteId=a1xgiRlpXr
Changes Since Last Submission: Thanks to the reviewers’ feedback, we made the following changes: * Section 1: Justified the importance of reproducing the original paper. * Section 3.2: Added explanation for the derivation of the binary score. * Section 3.3: * Clarified threshold selection. * Added details on the circuit discovery process. * Provided background used later for head score calculation (Section 5.2). * Added Appendix A: Mechanistic Interpretability for further explanation. * Section 3.4: Highlighted changes from the original method. * Section 3.6: Added explanation for z-normalization. * Section 5.2: * Clarified the difference between activation patching and EAP. * Reported number of splits per experiment. * Expanded gradient flow discussion with numerical values. * Section 5.3: Fixed phrasing ambiguity in “Finding Vulnerable Components.” * Styling: Fixed minor formatting issues throughout.
Assigned Action Editor: ~Amit_Sharma3
Submission Number: 4355
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview