Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
Open-Vocabulary Natural-Language Explanations of LLM Activations via Soft Prompts
Bart Bussmann
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Learning to Steer: Input-dependent Steering for Multimodal LLMs
Jayneel Parekh
,
Pegah KHAYATAN
,
Mustafa Shukor
,
Arnaud Dapogny
,
Alasdair Newson
,
Matthieu Cord
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Trilemma of Truth in Large Language Models
Germans Savcisens
,
Tina Eliassi-Rad
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Muhammad Umair Haider
,
Hammad Rizwan
,
Hassan Sajjad
,
Peizhong Ju
,
A.B. Siddique
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Finding Manifolds With Bilinear Autoencoders
Thomas Dooms
,
Ward Gauderis
Published: 30 Sept 2025, Last Modified: 13 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Millicent Li
,
Alberto Mario Ceballos Arroyo
,
Giordano Rogers
,
Naomi Saphra
,
Byron C Wallace
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Ziqian Zhong
,
Aditi Raghunathan
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Fluid Reasoning Representations
Dmitrii Kharlapenko
,
Alessandro Stolfo
,
Arthur Conmy
,
Mrinmaya Sachan
,
Zhijing Jin
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
WASP: A Weight-Space Approach to Detecting Learned Spuriousness
Cristian Daniel Paduraru
,
Antonio Barbalau
,
Radu Filipescu
,
Andrei Liviu Nicolicioiu
,
Elena Burceanu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Cheng Wang
,
Zeming Wei
,
Qin Liu
,
Wenxuan Zhou
,
Muhao Chen
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Antonio Barbalau
,
Cristian Daniel Paduraru
,
Teodor Poncu
,
Alexandru Tifrea
,
Elena Burceanu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Adversarial Attacks Leverage Interference Between Features in Superposition
Edward Stevinson
,
Lucas Prieto
,
Melih Barsbey
,
Tolga Birdal
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research
Sean Trott
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Helena Casademunt
,
Caden Juang
,
Adam Karvonen
,
Samuel Marks
,
Senthooran Rajamanoharan
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Compressed Computation is (probably) not Computation in Superposition
Jai Bhagat
,
Sara Molas-Medina
,
Giorgi Giglemiani
,
Stefan Heimersheim
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Mohammad Taufeeque
,
Aaron David Tucker
,
Adam Gleave
,
Adrià Garriga-Alonso
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
Farnoush Rezaei Jafari
,
Oliver Eberle
,
Ashkan Khakzar
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 28 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Do We Always Need Sampling? Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression
Julianna Piskorz
,
Kasia Kobalczyk
,
Mihaela van der Schaar
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
The Impossibility of Inverse Permutation Learning in Transformer Models
Rohan Alur
,
Chris Hays
,
Manish Raghavan
,
Devavrat Shah
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Interpretability for Time Series Transformers using A Concept Bottleneck Framework
Angela van Sprang
,
Erman Acar
,
Willem Zuidema
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
Likun Tan
,
Kuan-Wei Huang
,
Joy Shi
,
Kevin Wu
Published: 30 Sept 2025, Last Modified: 08 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Some Attention is All You Need for Retrieval
Felix Michalak
,
Steven Abreu
Published: 30 Sept 2025, Last Modified: 21 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Lorenzo Basile
,
Valentino Maiorca
,
Diego Doimo
,
Francesco Locatello
,
Alberto Cazzaniga
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
Maxime Méloux
,
François Portet
,
Maxime Peyrard
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models
Jatin Nainani
,
Bryn Marie Reimer
,
Connor Watts
,
David Jensen
,
Anna G. Green
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»