MASteer: An End-to-End Multi-Strategy Adaptive Steering Framework for Trustworthiness Alignment of LLMs

MASteer: An End-to-End Multi-Strategy Adaptive Steering Framework for Trustworthiness Alignment of LLMs

ACL ARR 2026 January Submission7658 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Trustworthy Large Language Models, Model Steering, Adaptive Strategy Selection, Representation Engineering, End-to-End Framework

Abstract: Large Language Models (LLMs) exhibit persistent and evolving trustworthiness issues, motivating the need for automated and flexible repair methods that can be reliably deployed across diverse scenarios. Representation Engineering (RE) steers model behavior by injecting concept-specific vectors at inference time. However, existing RE approaches rely on static steering strategies, where a fixed steering vector is uniformly applied to all samples, limiting application flexibility, while using outdated datasets to compute steering vectors hinders adaptation to evolving trustworthiness issues. To address these limitations, we analyze the applicability differences across RE algorithms and introduce anchor vectors to explicitly encode each algorithm’s sample-level applicability, enabling an anchor-matching mechanism that adaptively selects appropriate steering vectors during inference. Further, we propose ***MASteer***, the first end-to-end RE-based multi-strategy adaptive steering framework, which constructs up-to-date steering samples from natural-language issue descriptions, and maintains an evolving algorithm library for strategy generation, enabling continual updates for lifelong trustworthiness alignment. Experiments show that ***MASteer*** improves metrics by 19.29\% on LLaMA-3.1-8B-Chat while preserving general model capabilities, and further validates its practical value for customized trustworthiness alignment.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: (Language Modeling): safety and alignment, (Machine Learning for NLP): representation learning, (Language Modeling): continual learning, (Language Modeling): LLM/AI agents

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 7658

Loading