Keywords: Transformers, interpretability
TL;DR: We analyze why Transformers learn to compute features that may not be immediately useful for predicting the next token.
Abstract: Why do models trained for next token prediction (NTP) learn to compute abstract features that appear to be useless for this task? We formalize three mechanisms of feature development in Transformers, differing in their role in NTP, and propose a method to estimate the influence of each mechanism on the emergence of specific features. We study this distinction experimentally by analyzing the representations of models trained on synthetic tasks, as well as those of an LLM. Our findings shed light on how Transformers develop and use hidden features, and how the NTP objective affects the training outcome.
Submission Number: 22
Loading