Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
David Chanin
,
Adrià Garriga-Alonso
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Ruizhe Li
,
Chen Chen
,
Yuchen Hu
,
Yanjun Gao
,
Xi Wang
,
Emine Yilmaz
Published: 30 Sept 2025, Last Modified: 02 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Unveiling the Latent Directions of Reflection in Large Language Models
Fu-Chieh Chang
,
Yu-Ting Lee
,
Pei-Yuan Wu
Published: 30 Sept 2025, Last Modified: 24 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Multimodal Concept Bottleneck Models
Tongqing Shi
,
Ge Yan
,
Tuomas Oikarinen
,
Tsui-Wei Weng
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Adam Karvonen
,
Samuel Marks
Published: 30 Sept 2025, Last Modified: 23 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Convergent Linear Representations of Emergent Misalignment
Anna Soligo
,
Edward Turner
,
Senthooran Rajamanoharan
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
When seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
Francesco Ortu
,
Zhijing Jin
,
Diego Doimo
,
Alberto Cazzaniga
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Mapping Faithful Reasoning in Language Models
Jiazheng Li
,
Andreas Damianou
,
J Rosser
,
Jose Luis Redondo Garcia
,
Konstantina Palla
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
J Rosser
,
Jose Luis Redondo Garcia
,
Gustavo Penha
,
Konstantina Palla
,
Hugues Bouchard
Published: 30 Sept 2025, Last Modified: 23 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
LLM Pretraining with Continuous Concepts
Jihoon Tack
,
Jack Lanchantin
,
Jane Yu
,
Andrew Cohen
,
Ilia Kulikov
,
Janice Lan
,
Shibo Hao
,
Yuandong Tian
,
Jason E Weston
,
Xian Li
Published: 30 Sept 2025, Last Modified: 28 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Latent Crystallographic Microscope: Probing the Emergent Crystallographic Knowledge in Large Language Models
Jingru Gan
,
Yanqiao Zhu
,
Wei Wang
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin
,
Dana Arad
,
Yossi Gandelsman
,
Yonatan Belinkov
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»