everyone
since 21 Sept 2024">EveryoneRevisionsBibTeXCC BY 4.0
Transformer-based Language Models have advanced natural language processing with their ability to generate fluent text. However, these models exhibit and amplify toxicity and bias learned from training data---posing new ethical challenges. This work builds upon the AttentionLens framework to allow for scalable decoding of attention mechanism information. We then use this decoded information to implement a pipeline to localize and remove toxic memories from pre-trained language models in a way that is both human interpretable and effective while retaining model performance.