Mind Your Manners: Detoxifying Language Models via Attention Head Intervention

Jordan Nikolai Pettyjohn; Nathaniel C Hudson; Mansi Sakarvadia; Aswathy Ajith; Kyle Chard

Mind Your Manners: Detoxifying Language Models via Attention Head Intervention

Jordan Nikolai Pettyjohn, Nathaniel C Hudson, Mansi Sakarvadia, Aswathy Ajith, Kyle Chard

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Track: Extended abstract

Keywords: Language models, bias, toxicity, interpretability, transformers, attention, ethical AI

TL;DR: We use AttentionLens to identify and mitigate toxic behavior in language models by targeting and removing toxic memories from specific attention heads, achieving significant toxicity reduction with minimal impact on model performance.

Abstract: Transformer-based Language Models have advanced natural language processing with their ability to generate fluent text. However, these models exhibit and amplify toxicity and bias learned from training data---posing new ethical challenges. This work builds upon the AttentionLens framework to allow for scalable decoding of attention mechanism information. We then use this decoded information to implement a pipeline to localize and remove toxic memories from pre-trained language models in a way that is both human interpretable and effective while retaining model performance.

Submission Number: 69

Loading