Steering LLMs' Behavior with Concept Activation Vectors

Ruixuan Huang; Shuai Wang

Steering LLMs' Behavior with Concept Activation Vectors

Ruixuan Huang, Shuai Wang

Published: 26 Feb 2025, Last Modified: 26 Feb 2025ICLR 2025 Blogpost TrackEveryoneRevisionsBibTeXCC BY 4.0

Blogpost Url: https://d2jud02ci9yv69.cloudfront.net/2025-05-07-steering-llms-behavior-40/blog/steering-llms-behavior/

Abstract: Concept activation vectors have been shown to take effects in safety concepts, efficiently and effectively guiding a considerable number of open-source large language models (LLMs) to respond positively to malicious instructions. In this blog, we aim to explore the capability boundaries of concept activation vectors in guiding various behaviors of LLMs through more extensive experiments. Our experiments demonstrate that this reasoning technique can low-costly transfer text styles and improve performance on specific tasks such as code generation.

Conflict Of Interest: NA

Submission Number: 17

Loading