Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: interpretability, natural language processing
Abstract: Large pretrained language models have demonstrated impressive capabilities, but there is still much to learn about how they operate mechanically. In this study, we conduct a multifaceted investigation of the autoregressive transformer's ability to perform basic addition operations. Specifically, we use casual tracing to locate the information flow between attention and the fully-connected layer. For attention layers, we found that they exploit fixed patterns in the intermediate stage to perform the transfer of carry and numeric information. They project the input onto the distribution of a few neurons in later fully-connected layers, where the neurons activate the vocabulary distribution existing in the parameter space to implement the mapping relationship. In addition, our research can be further extended to the study of interpretability of general classification tasks like sentiment analysis. The findings suggest that, although the model appears to have learned some arithmetic rules, most of its reasoning still relies on statistical patterns.
Submission Number: 66
Loading