MASteer: An End-to-End Multi-Strategy Adaptive Steering Framework for Trustworthiness Alignment of LLMs
Keywords: Trustworthy Large Language Models, Model Steering, Adaptive Strategy Selection, Representation Engineering, End-to-End Framework
Abstract: Large Language Models (LLMs) exhibit persistent and evolving trustworthiness issues, motivating the need for automated and flexible repair methods that can be reliably deployed across diverse scenarios. Representation Engineering (RE) steers model behavior by injecting concept-specific vectors at inference time. However, existing RE approaches rely on static steering strategies, where a fixed steering vector is uniformly applied to all samples, limiting application flexibility, while using outdated datasets to compute steering vectors hinders adaptation to evolving trustworthiness issues.
To address these limitations, we analyze the applicability differences across RE algorithms and introduce anchor vectors to explicitly encode each algorithm’s sample-level applicability, enabling an anchor-matching mechanism that adaptively selects appropriate steering vectors during inference.
Further, we propose ***MASteer***, the first end-to-end RE-based multi-strategy adaptive steering framework, which constructs up-to-date steering samples from natural-language issue descriptions, and maintains an evolving algorithm library for strategy generation, enabling continual updates for lifelong trustworthiness alignment.
Experiments show that ***MASteer*** improves metrics by 19.29\% on LLaMA-3.1-8B-Chat while preserving general model capabilities, and further validates its practical value for customized trustworthiness alignment.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: (Language Modeling): safety and alignment, (Machine Learning for NLP): representation learning, (Language Modeling): continual learning, (Language Modeling): LLM/AI agents
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 7658
Loading