Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
Representation Similarity Reveals Implicit Layer Grouping in Neural Networks
Tian Gao
,
Amit Dhurandhar
,
Karthikeyan Natesan Ramamurthy
,
Dennis Wei
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Detecting Motivated Reasoning in the Internal Representations of Language Models
Parsa Mirtaheri
,
Mikhail Belkin
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Shared Memorization Structures in Transformers Revealed by Loss Curvature
Jack Merullo
,
Srihita Vatsavaya
,
Owen Lewis
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition
Lucas Prieto
,
Edward Stevinson
,
Melih Barsbey
,
Tolga Birdal
,
Pedro A. M. Mediano
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Enforcing Orderedness in SAEs to Improve Feature Consistency
Sophie L. Wang
,
Alex Quach
,
Nithin Parsan
,
John Jingxuan Yang
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models
Aashiq Muhamed
,
Xuandong Zhao
,
Mona T. Diab
,
Virginia Smith
,
Dawn Song
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Attention Pattern Discovery at Scale
Jonathan Katzy
,
Razvan Mihai Popescu
,
Erik Mekkes
,
Arie van Deursen
,
Maliheh Izadi
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Composable Sparse Subnetworks via Maximum-Entropy Principle
Francesco Caso
,
Samuele Fonio
,
Nicola Saccomanno
,
Simone Monaco
,
Fabrizio Silvestri
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Spectral Dynamics in Neural Network Training: Mathematical Foundations for Understanding Representational Development
Brian Richard Olsen
,
Sam Fatehmanesh
,
Frank Xiao
,
Adarsh Kumarappan
,
Anirudh Gajula
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
,
Junxuan Wang
,
Rui Lin
,
Xuyang Ge
,
Wentao Shu
,
Qiong Tang
,
Junping Zhang
,
Xipeng Qiu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Evaluating Explanatory Evaluations: An Explanatory Virtues Framework for Mechanistic Interpretability
Kola Ayonrinde
,
Louis Jaburi
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Model Diffing without Borders: Unlocking Cross-Architecture Model Diffing to Reveal Hidden Ideological Alignment in Llama and Qwen
Thomas Jiralerspong
,
Trenton Bricken
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models
Jongwook Han
,
Jongwon Lim
,
InJin Kong
,
Yohan Jo
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua
,
Andrew Qin
,
Samuel Marks
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 24 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models
Jonas Rohweder
,
Aiden Zhou
,
Aniruddhan Ramesh
,
Sharvil Limaye
,
Akshay Bhaskar
,
Ashwinee Panda
,
Vasu Sharma
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Xinyuan Yan
,
Shusen Liu
,
Kowshik Thopalli
,
Bei Wang Phillips
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Emergent World Beliefs: Exploring Transformers in Stochastic Games
Adam Kamel
,
Tanish Rastogi
,
Michael Ma
,
Kailash Ranganathan
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Mitigating Emergent Misalignment with Data Attribution
Louis Jaburi
,
Gonçalo Paulo
,
Stepan Shabalin
,
Lucia Quirke
,
Nora Belrose
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
David Chanin
,
Tomáš Dulka
,
Adrià Garriga-Alonso
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality
Lingjing Kong
,
Shaoan Xie
,
Guangyi Chen
,
Yuewen Sun
,
Xiangchen Song
,
Eric P. Xing
,
Kun Zhang
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Rethinking Crowd-Sourced Evaluation of Neuron Explanations
Tuomas Oikarinen
,
Ge Yan
,
Akshay R. Kulkarni
,
Tsui-Wei Weng
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Mitigating Sycophancy in Language Models via Sparse Activation Fusion and Multi-Layer Activation Steering
Pyae Phoo Min
,
Avigya Paudel
,
Naufal Adityo
,
Arthur Zhu
,
Andrew Rufail
,
Cole Blondin
,
Kevin Zhu
,
Sunishchal Dev
,
Sean O'Brien
Published: 30 Sept 2025, Last Modified: 29 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Attention Layers Add Into Low-Dimensional Residual Subspaces
Junxuan Wang
,
Xuyang Ge
,
Wentao Shu
,
Zhengfu He
,
Xipeng Qiu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
Brad Peters
,
Sayam Goyal
,
María Emilia Granda
,
Akshath Vijayakumar Narmadha
,
Dharunish Yugeswardeenoo
,
Callum Stuart McDougall
,
Sean O'Brien
,
Ashwinee Panda
,
Kevin Zhu
,
Cole Blondin
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Causal Discovery and Inference through Next-Token Prediction
Eivinas Butkus
,
Nikolaus Kriegeskorte
Published: 30 Sept 2025, Last Modified: 22 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»