FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

ACL ARR 2026 January Submission1458 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Safety, Hallucination, Activation Steering

Abstract: Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigid, one-size-fits-all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages—conditional steering and fine-grained vector synthesis—allowing fine-grained control over when and how to steer internal representations. In the first stage, we introduce the Subspace-guided Conditional Steering (SCS) mechanism, which preserves model utility by avoiding unnecessary steering. In the second stage, we propose the Mixture-of-Steering-Experts (MoSE) mechanism, which captures the multi-modal nature of desired steering and synthesizes query-specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training-efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms the state-of-the-art methods in overall performance metrics (e.g., a 7.6\% improvement on TruthfulQA over Llama3), achieving stronger steering performance with minimal utility loss.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 1458

Loading