From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

ACL ARR 2025 July Submission1175 Authors

29 Jul 2025 (modified: 03 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks like Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, such as "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2’s logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token that does not appear in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90\% of the original model’s performance. By relating our findings to IOI analysis, we provide new insights into the roles of certain attention heads and MLPs in LMs. We believe these insights contribute to a broader understanding of model reasoning and benefit future research in mechanistic interpretability.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Mechanistic Interpretability, Circuit Analysis, Explainable AI
Contribution Types: Model analysis & interpretability
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: 8
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Acknowledgements
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: Dataset is extremely simple logical premises and conclusions solely containing letters and truth values
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: 3 and appendix A
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 3
C2 Experimental Setup And Hyperparameters: N/A
C3 Descriptive Statistics: Yes
C3 Elaboration: 3
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1175
Loading