TL;DR: A novel and generalized approach for detecting misbehavior in large language models based on their internal "Brain Scanning".
Abstract: Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
Lay Summary: Large language models (LLMs) can generate fluent, human-like text across many domains. But sometimes, they also produce responses that are untruthful, harmful, or even manipulated, raising risks in sensitive settings like healthcare, finance, or education. How can we tell when a model is misbehaving?
In this work, we introduce LLMScan, a tool designed to monitor and detect such misbehavior by analyzing how the model "thinks" internally. Instead of just looking at what the model says, LLMScan examines how different parts of the model contribute to the answer it generates by causality analysis. With the generated causal map, we can detect patterns that distinguish normal behavior from harmful or deceptive ones.
Through extensive experiments across tasks and models, LLMScan shows that it can reliably identify when a model is going off track. This opens the door to safer and more trustworthy language models that can be monitored and corrected in real time.
Primary Area: General Machine Learning->Causality
Keywords: Large Language Model, Causality Inference, LLM Detection
Submission Number: 3580
Loading