Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
Uncovering Object Localization Mechanisms in VLMs
Timothy Schaumlöffel
,
Martina G. Vilas
,
Gemma Roig
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Feature interactions in sparse crosscoders from compact proofs
Dmitry Manning-Coe
,
Thomas Read
,
Anna Soligo
,
Oliver Clive-Griffin
,
Chun Hei Yip
,
Alex Gibson
,
Rajashree Agrawal
,
Jason Gross
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Dissecting Role Conflicts in Instruction Following
Siqi Zeng
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Angular Steering: Behavior Control via Rotation in Activation Space
Hieu M. Vu
,
Tan Minh Nguyen
Published: 30 Sept 2025, Last Modified: 15 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Training Reliable Activation Probes With a Handful of Positive Examples
Riya Tyagi
,
Stefan Heimersheim
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Probing by Analogy: Decomposing Probes into Activations for Better Interpretability and Inter-Model Generalization
Patrick Leask
,
Noura Al Moubayed
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation
Robert Graham
,
Edward Stevinson
,
Leo Richter
,
Alexander Chia
,
Joseph Miller
,
Joseph Isaac Bloom
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Lorenz Hufe
,
Constantin Venhoff
,
Maximilian Dreyer
,
Sebastian Lapuschkin
,
Wojciech Samek
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
Ege Erdogan
,
Ana Lucic
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Equivalent Linear Mappings of LLMs
James Robert Golden
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models
Ej Zhou
,
Caiqi Zhang
,
Tiancheng Hu
,
Chengzu Li
,
Nigel Collier
,
Ivan Vulić
,
Anna Korhonen
Published: 30 Sept 2025, Last Modified: 11 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Xiangchen Song
,
Aashiq Muhamed
,
Yujia Zheng
,
Lingjing Kong
,
Zeyu Tang
,
Mona T. Diab
,
Virginia Smith
,
Kun Zhang
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem
Adam Newgas
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
Samaksh Bhargav
,
Zining Zhu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Andrew Lee
,
Lihao Sun
,
Chris Wendler
,
Fernanda Viégas
,
Martin Wattenberg
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Circuit-Tracer: A New Library for Finding Feature Circuits
Michael Hanna
,
Mateusz Piotrowski
,
Jack Lindsey
,
Emmanuel Ameisen
Published: 30 Sept 2025, Last Modified: 27 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Centroid Affinity: How Deep Networks Represent Features
Thomas Walker
,
Ahmed Imtiaz Humayun
,
Randall Balestriero
,
Richard Baraniuk
Published: 30 Sept 2025, Last Modified: 20 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
Brady Bhalla
,
Honglu Fan
,
Nancy Chen
,
Tony Yue YU
Published: 30 Sept 2025, Last Modified: 01 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Towards Trustworthy Neuron Identification: Faithfulness and Stability
Ge Yan
,
Tuomas Oikarinen
,
Tsui-Wei Weng
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Decomposing Attention To Find Context-Sensitive Neurons
Alex Gibson
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Emergence of Linear Truth Encodings in Language Models
Shauli Ravfogel
,
Gilad Yehudai
,
Tal Linzen
,
Joan Bruna
,
Alberto Bietti
Published: 30 Sept 2025, Last Modified: 17 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Clément Dumas
,
Julian Minder
,
Caden Juang
,
Bilal Chughtai
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Sparse Autoencoders Trained on the Same Data Learn Different Features
Gonçalo Paulo
,
Nora Belrose
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Evaluating SAE interpretability without explanations
Gonçalo Paulo
,
Nora Belrose
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Quiet Feature Learning in Algorithmic Tasks
Prudhviraj Naidu
,
Zixian Wang
,
Leon Bergen
,
Ramamohan Paturi
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»