Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
Chung-En Sun
,
Ge Yan
,
Akshay R. Kulkarni
,
Tsui-Wei Weng
Published: 30 Sept 2025, Last Modified: 15 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences
Julian Minder
,
Clément Dumas
,
Stewart Slocum
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Localizing Reasoning Training-Induced Changes in Large Language Models
Max Klabunde
,
Florian Lemmerich
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Rank-1 Reasoning: Minimal Parameter Diffs Encode Interpretable Reasoning Signals
Jake Ward
,
Paul M. Riechers
,
Adam Shai
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Unsupervised decoding of encoded reasoning using language model interpretability
Ching Fang
,
Samuel Marks
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Just-in-time and distributed task representations in language models
Yuxuan Li
,
Declan Iain Campbell
,
Stephanie C.Y. Chan
,
Andrew Kyle Lampinen
Published: 30 Sept 2025, Last Modified: 16 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Faithfulness through Causal Abstraction: Aligning explanations of how models reason
Mette Friis Andersen
,
Ana Lucic
,
Maria Heuss
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Token Entanglement in Subliminal Learning
Amir Zur
,
Zhuofan Ying
,
Alexander Russell Loftus
,
Kerem Şahin
,
Steven Yu
,
Lucia Quirke
,
Tamar Rott Shaham
,
Natalie Shapira
,
Hadas Orgad
,
David Bau
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Eliciting Secret Knowledge from Language Models
Bartosz Cywiński
,
Emil Ryd
,
Rowan Wang
,
Senthooran Rajamanoharan
,
Neel Nanda
,
Arthur Conmy
,
Samuel Marks
Published: 30 Sept 2025, Last Modified: 09 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
Qinyuan Ye
,
Robin Jia
,
Xiang Ren
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Detecting and Characterizing Planning in Language Models
Jatin Nainani
,
Sankaran Vaidyanathan
,
Connor Watts
,
Andre N. Assis
,
Alice Rigg
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Dense SAE Latents Are Features, Not Bugs
Xiaoqing Sun
,
Alessandro Stolfo
,
Joshua Engels
,
Ben Peng Wu
,
Senthooran Rajamanoharan
,
Mrinmaya Sachan
,
Max Tegmark
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Reverse-Engineering Memory in DreamerV3: From Sparse Representations to Functional Circuits
Jan Sobotka
,
Auke Ijspeert
,
Guillaume Bellegarda
Published: 30 Sept 2025, Last Modified: 21 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
From Black-box to Causal-box: Towards Building More Interpretable Models
Inwoo Hwang
,
Yushu Pan
,
Elias Bareinboim
Published: 30 Sept 2025, Last Modified: 24 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs
Fenil R. Doshi
,
Thomas Fel
,
Talia Konkle
,
George A. Alvarez
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS
Stefan F. Schouten
,
Peter Bloem
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models
Josep Lopez Camuñas
,
Christy Li
,
Tamar Rott Shaham
,
Antonio Torralba
,
Agata Lapedriza
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Understanding sparse autoencoder scaling in the presence of feature manifolds
Eric J Michaud
,
Liv Gorton
,
Tom McGrath
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
Erum Mushtaq
,
Anil Ramakrishna
,
Satyapriya Krishna
,
Sattvik Sahai
,
Prasoon Goyal
,
Kai-Wei Chang
,
Tao Zhang
,
Rahul Gupta
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions
Dat Minh Hong
,
Bruno Kacper Mlodozeniec
,
Runa Eschenhagen
,
Richard E. Turner
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Adaptive Task Vectors for Large Language Models
Joonseong Kang
,
Soojeong Lee
,
Subeen Park
,
Sumin Park
,
Taero Kim
,
Jihee Kim
,
Ryunyi LEE
,
Kyungwoo Song
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Looking into Black Box Code Language Models
Muhammad Umair Haider
,
Umar Farooq
,
A.B. Siddique
,
Mark Marron
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
Luca Baroni
,
Galvin Khara
,
Joachim Schaeffer
,
Marat Subkhankulov
,
Stefan Heimersheim
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Jeremias Lino Ferrao
,
Matthijs van der Lende
,
Ilija Lichkovski
,
Clement Neo
Published: 30 Sept 2025, Last Modified: 27 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Controlling Vision–Language–Action Policies through Sparse Latent Directions
Momin Ahmad Khan
,
Novak Boskov
,
Fatima M. Anwar
,
Manzoor A. Khan
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»