Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features
Lachin Naghashyar
,
Hunar Batra
,
Ashkan Khakzar
,
Philip Torr
,
Ronald Clark
,
Christian Schroeder de Witt
,
Constantin Venhoff
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers
Clément Dumas
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Emergent Specialization: Rare Token Neurons in Language Models
Jing Liu
,
Yueheng Li
,
Haozheng Wang
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
TopKLoRA
Marek Masiak
,
Lukas Vierling
,
Constantin Venhoff
,
Nicola Cancedda
,
Christian Schroeder de Witt
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
Isha Agarwal
,
Saharsha Navani
,
Fazl Barez
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
High-order Component Attribution via Kolmogorov-Arnold Networks
Samy Mammeri
,
Christian Gagné
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors
Viacheslav Sinii
,
Nikita Balagansky
,
Yaroslav Aksenov
,
Vadim Kurochkin
,
Daniil Laptev
,
Alexey Gorbatovski
,
Boris Shaposhnikov
,
Daniil Gavrilov
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Interpreting Vision Grounding in Vision-Language Models: A Case Study in Coordinate Prediction
Clement Neo
,
Yongsen Zheng
,
Kwok-Yan Lam
,
Luke Ong
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Reverse Engineering a Stateful Reasoning circuit
Akshit Kumar
,
Dipti Sharma
,
Parameswari Krishnamurthy
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness
Xingmeng Zhao
,
Ke Yang
,
Anthony Rios
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Chain-of-Thought Resampling for Interpreting LLM Decision-Making
Uzay Macar
,
Paul C. Bogdan
,
Senthooran Rajamanoharan
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework
Hao Gu
,
Vibhas Nair
,
Amrithaa Ashok Kumar
,
Ryan Lagasse
,
Kevin Zhu
,
Sean O'Brien
,
Ashwinee Panda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Symbolic vs. Continuous Features in Transformers: A Digital Communication System's Explanation
Kan Deng
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka
,
Xue Jiang
,
Dmitrii Usynin
,
Xuebing Zhou
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs
Hanqi Yan
,
Hainiu Xu
,
Yulan He
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
Nathan Paek
,
Yongyi Zang
,
Qihui Yang
,
Randal Leistikow
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Iterative Inference in a Chess-Playing Neural Network
Elias Sandmann
,
Sebastian Lapuschkin
,
Wojciech Samek
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Activation Transport Operators
Andrzej Szablewski
,
Marek Masiak
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Superposition in Mixture of Experts
Marmik Chaudhari
,
Jeremi Nuer
,
Rome Thorstenson
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing
Saleena Angeline Sartawita
,
McNair Shah
,
Adhitya Rajendra Kumar
,
Naitik Chheda
,
Will Cai
,
Kevin Zhu
,
Sean O'Brien
,
Vasu Sharma
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning
Wannan Yang
,
Xinchi Qiu
,
Lei Yu
,
Yuchen Zhang
,
Aobo Yang
,
Narine Kokhlikyan
,
Nicola Cancedda
,
Diego Garcia-Olano
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
Dani Roytburg
,
Matthew Nguyen
,
Matthew Bozoukov
,
Jou Barzdukas
,
Hongyu Fu
,
Narmeen Fatimah Oozeer
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar
,
Tim Bakker
,
Dana Kianfar
,
Cristina Pinneri
,
Christos Louizos
Published: 30 Sept 2025, Last Modified: 02 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
On the Geometry and Topology of Neural Circuits for Modular Addition
Gabriela Moisescu-Pareja
,
Gavin McCracken
,
Harley Wiltzer
,
Colin Daniels
,
Vincent Létourneau
,
Jonathan Love
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models
Eric Lacosse
,
Mariana Duarte
,
Peter Todd
,
Daniel C McNamee
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»