Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop MechInterp Submissions
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
Bofan Gong
,
Shiyang Lai
,
Dawn Song
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Bilinear Convolution Decomposition for Causal RL Interpretability
Sinem Erisken
,
Alice Rigg
,
Narmeen Fatimah Oozeer
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Base Models Know How to Reason, Thinking Models Learn When
Constantin Venhoff
,
Iván Arcuschin
,
Philip Torr
,
Arthur Conmy
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 10 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering
Rishab Alagharu
,
Ishneet Sukhvinder Singh
,
Anjali Batta
,
Jaelyn S. Liang
,
Shaibi Shamsudeen
,
Arnav Sheth
,
Kevin Zhu
,
Ashwinee Panda
,
Zhen Wu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Control and Predictivity in Neural Interpretability
Satchel Grant
,
Alexa R. Tartaglini
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Interpretability at the Network Level: Prior-Guided Drift Diffusion for Neural Circuit Analysis
Tahereh Toosi
Published: 30 Sept 2025, Last Modified: 20 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Measuring Sparse Autoencoder Feature Sensitivity
Claire Tian
,
Katherine Tian
,
Nathan Zixia Hu
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Better World Models Can Lead to Better Post-Training Performance
Prakhar Gupta
,
Henry Conklin
,
Sarah-Jane Leslie
,
Andrew Lee
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Symbolic Policy Distillation for Interpretable Reinforcement Learning
Peilang Li
,
Umer Siddique
,
Yongcan Cao
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
On the Limits of Linear Representation Hypotheses in Large Language Models: A Dynamical Systems Analysis
Abhinav Muraleedharan
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Model Organisms for Emergent Misalignment
Edward Turner
,
Anna Soligo
,
Mia Taylor
,
Senthooran Rajamanoharan
,
Neel Nanda
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Instruction Following by Boosting Attention of Large Language Models
Vitoria Guardieiro
,
Adam Stein
,
Avishree Khare
,
Eric Wong
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
From Vortices to Spirals: Physics Foundation Models Learn Cross-Domain Concepts
Rio Alexa Fear
,
Miles Cranmer
,
Payel Mukhopadhyay
,
Michael McCabe
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Where's the Bug? Attention Probing for Scalable Fault Localization
Adam Stein
,
Arthur Wayne
,
Aaditya Naik
,
Mayur Naik
,
Eric Wong
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Automatically Finding Rule-Based Neurons in OthelloGPT
Aditya Singh
,
Zihang Wen
,
Srujananjali Medicherla
,
Adam Karvonen
,
Can Rager
Published: 30 Sept 2025, Last Modified: 28 Oct 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash
,
Natalie Shapira
,
Arnab Sen Sharma
,
Christoph Riedl
,
Yonatan Belinkov
,
Tamar Rott Shaham
,
David Bau
,
Atticus Geiger
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Entity Multiplexing Through Activation Strength: Understanding goals in A Maze Solving Agent
Benjamin Sturgeon
,
Jonathan P. Shock
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Towards Decomposition of Transformer Models
Casper L. Christensen
,
Logan Riggs Smith
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Can Interpretation Predict Behavior on Unseen Data?
Victoria R Li
,
Jenny Kaufmann
,
Martin Wattenberg
,
David Alvarez-Melis
,
Naomi Saphra
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Extracting Reliable Concept Signals from Just a Handful of Superdetector Tokens
Cassandra Goldberg
,
Chaehyeon Kim
,
Adam Stein
,
Eric Wong
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Learned Structure in CARTRIDGES: Keys as Shareable Routers in Self-Studied Representations
Mauri Diaz
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
How does Mamba Perform Associative Recall? A Mechanistic Study
Grégoire LE CORRE
,
Ningyuan Huang
,
Alberto Bietti
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Predicting Weak-to-Strong Generalization from Latent Representations
Ben Wilop
,
Christian Schroeder de Witt
,
Yarin Gal
,
Philip Torr
,
Constantin Venhoff
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behaviour
Daniel Aarao Reis Arturi
,
Eric Zhang
,
Andrew Adrian Ansah
,
Kevin Zhu
,
Ashwinee Panda
,
Aishwarya Balwani
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Spotlight
Readers:
Everyone
Demystifying Cipher-Following in Large Language Models via Activation Analysis
Megan Gross
,
Yigitcan Kaya
,
Christopher Kruegel
,
Giovanni Vigna
Published: 30 Sept 2025, Last Modified: 30 Sept 2025
Mech Interp Workshop (NeurIPS 2025) Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»