Toggle navigation
OpenReview
.net
Login
×
Back to
ICML
ICML 2024 Workshop MI Submissions
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
,
Oscar Balcells Obeso
,
Aaquib Syed
,
Daniel Paleka
,
Nina Panickssery
,
Wes Gurnee
,
Neel Nanda
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Xuyang Ge
,
Fukang Zhu
,
Wentao Shu
,
Junxuan Wang
,
Zhengfu He
,
Xipeng Qiu
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
,
Yuchen Zhang
,
Shichang Zhang
,
Fan Yin
,
Difan Zou
,
Yisong Yue
,
Ziniu Hu
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Segmentation CNNs are denoising models
Luis A. Zavala-Mondragón
,
Ruud Van Sloun
,
Peter H.N. de With
,
Fons van der Sommen
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Delay Embedding Theory of Neural Sequence Models
Mitchell Ostrow
,
Adam Joseph Eisen
,
Ila R Fiete
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task
Javier Ferrando
,
Marta R. Costa-jussà
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
Sriram Balasubramanian
,
Samyadeep Basu
,
Soheil Feizi
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
,
Yo Joong Choe
,
Yibo Jiang
,
Victor Veitch
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Oral
Readers:
Everyone
Adversarial Circuit Evaluation
Niels uit de Bos
,
Adrià Garriga-Alonso
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Samyak Jain
,
Ekdeep Singh Lubana
,
Kemal Oksuz
,
Tom Joy
,
Philip Torr
,
Amartya Sanyal
,
Puneet K. Dokania
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Boshi Wang
,
Xiang Yue
,
Yu Su
,
Huan Sun
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign
,
Adrià Garriga-Alonso
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Logical Distillation of Graph Neural Networks
Alexander Pluska
,
Pascal Welke
,
Thomas Gärtner
,
SAGAR MALHOTRA
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Benchmarking Mental State Representations in Language Models
Matteo Bortoletto
,
Constantin Ruhdorfer
,
Lei Shi
,
Andreas Bulling
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
,
Eric Winsor
Published: 24 Jun 2024, Last Modified: 24 Jun 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Yoann Poupart
Published: 24 Jun 2024, Last Modified: 24 Jun 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
Nicholas Bai
,
Rahul Ajay Iyer
,
Tuomas Oikarinen
,
Tsui-Wei Weng
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
How do Llamas process multilingual text? A latent exploration through activation patching
Clément Dumas
,
Veniamin Veselovsky
,
Giovanni Monea
,
Robert West
,
Chris Wendler
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
Iteration Head: A Mechanistic Study of Chain-of-Thought
Vivien Cabannes
,
Charles Arnal
,
Wassim Bouaziz
,
Xingyu Alice Yang
,
Francois Charton
,
Julia Kempe
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Hypothesis Testing the Circuit Hypothesis in LLMs
Claudia Shi
,
Nicolas Beltran-Velez
,
Achille Nazaret
,
Carolina Zheng
,
Adrià Garriga-Alonso
,
Andrew Jesson
,
Maggie Makar
,
David Blei
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Oral
Readers:
Everyone
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
Tianyu He
,
Darshil Doshi
,
Aritra Das
,
Andrey Gromov
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
Mathematical Models of Computation in Superposition
Kaarel Hänni
,
Jake Mendel
,
Dmitry Vaintrob
,
Lawrence Chan
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Transcoders find interpretable LLM feature circuits
Jacob Dunefsky
,
Philippe Chlenski
,
Neel Nanda
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Spotlight
Readers:
Everyone
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr
,
Jérémy Scheurer
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models
Sunny Duan
,
Mikail Khona
,
Abhiram Iyer
,
Rylan Schaeffer
,
Ila R Fiete
Published: 24 Jun 2024, Last Modified: 31 Jul 2024
ICML 2024 MI Workshop Poster
Readers:
Everyone
«
‹
1
2
3
4
›
»