Understanding task representations in neural networks via Bayesian ablation

Andrew Joohun Nam; Declan Iain Campbell; Thomas L. Griffiths; Jonathan D. Cohen; Sarah-Jane Leslie

Understanding task representations in neural networks via Bayesian ablation

Andrew Joohun Nam, Declan Iain Campbell, Thomas L. Griffiths, Jonathan D. Cohen, Sarah-Jane Leslie

Published: 06 Mar 2025, Last Modified: 06 Mar 2025ICLR 2025 Re-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Domain: cognitive science

Abstract: Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.

Submission Number: 37

Loading