Disentangling Transformer Language Models as Superposed Topic Models

Jia Peng Lim; Hady W. Lauw

Disentangling Transformer Language Models as Superposed Topic Models

Jia Peng Lim, Hady W. Lauw

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Interpretability, Interactivity, and Analysis of Models for NLP

Submission Track 2: Language Modeling and Analysis of Language Models

Keywords: Topic Modelling, Mechanistic Interpretability, Pre-trained Language Models, Transformers

TL;DR: We interpret Transformer Language Models as Topic Models with topics in superposition and propose a novel approach to disentangle superposed topics.

Abstract: Topic Modelling is an established research area where the quality of a given topic is measured using coherence metrics. Often, we infer topics from Neural Topic Models (NTM) by interpreting their decoder weights, consisting of top-activated words projected from individual neurons. Transformer-based Language Models (TLM) similarly consist of decoder weights. However, due to its hypothesised superposition properties, the final logits originating from the residual path are considered uninterpretable. Therefore, we posit that we can interpret TLM as superposed NTM by proposing a novel weight-based, model-agnostic and corpus-agnostic approach to search and disentangle decoder-only TLM, potentially mapping individual neurons to multiple coherent topics. Our results show that it is empirically feasible to disentangle coherent topics from GPT-2 models using the Wikipedia corpus. We validate this approach for GPT-2 models using Zero-Shot Topic Modelling. Finally, we extend the proposed approach to disentangle and analyse LLaMA models.

Submission Number: 1303

Loading