Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

Justin Theodorus; V Swaytha; Shivani Gautam; Adam Ward; Mahir Shah; Cole Blondin; Kevin Zhu

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

Justin Theodorus, V Swaytha, Shivani Gautam, Adam Ward, Mahir Shah, Cole Blondin, Kevin Zhu

Published: 05 Mar 2025, Last Modified: 15 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Tiny Paper Track (between 2 and 4 pages)

Keywords: mechanistic interpretability, chain-of-thought prompting, sparse autoencoders, large language models

TL;DR: We describe a methodology to uncover structured representations of reasoning errors in CoT prompting using Sparse Autoencoders.

Abstract: Current large language models often suffer from subtle, hard-to-detect reasoning errors in their intermediate chain-of-thought (CoT) steps. These errors include logical inconsistencies, factual hallucinations, and arithmetic mistakes, which compromise trust and reliability. While previous research focuses on mechanistic interpretability for best output, understanding and categorizing internal reasoning errors remains challenging. The complexity and non-linear nature of these CoT sequences call for methods to uncover structured patterns hidden within them. As an initial step, we evaluate Sparse Autoencoder (SAE) activations within neural networks to investigate how specific neurons contribute to different types of errors.

Submission Number: 152

Loading