Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: mechanistic interpretability, chain-of-thought prompting, sparse autoencoders, large language models
TL;DR: We describe a methodology to uncover structured representations of reasoning errors in CoT prompting using Sparse Autoencoders.
Abstract: Current large language models often suffer from subtle, hard-to-detect reasoning
errors in their intermediate chain-of-thought (CoT) steps. These errors include
logical inconsistencies, factual hallucinations, and arithmetic mistakes, which
compromise trust and reliability. While previous research focuses on mechanistic
interpretability for best output, understanding and categorizing internal reasoning
errors remains challenging. The complexity and non-linear nature of these CoT
sequences call for methods to uncover structured patterns hidden within them. As
an initial step, we evaluate Sparse Autoencoder (SAE) activations within neural
networks to investigate how specific neurons contribute to different types of errors.
Submission Number: 152
Loading