Interpretable Steering of Dense Embeddings for Controllable Retrieval

Interpretable Steering of Dense Embeddings for Controllable Retrieval

ACL ARR 2025 February Submission3276 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Embedding-based information retrieval models can suffer from false retrievals due to issues such as training data bias and polysemy. To address this problem, we propose a novel method for controlling embedding models through an interpretable steering technique based on Sparse Autoencoders (SAEs). SAEs decompose embeddings into semantically disentangled features, and a steering vector selectively enhances or suppresses features that contribute to false retrievals, thereby correcting the search results. Experimental results demonstrate that the proposed method effectively rectifies false retrievals within a limited range while maintaining the generalization performance of the model. However, limitations of the SAE, including potential performance degradation, possible side effects from polysemantic features, and the difficulty in determining optimal correction values, indicate the need for further research. Future work should focus on overcoming these limitations and expanding the scope of the interpretlalbr steering technique to build a more sophisticated search result correction system.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Dialogue and Interactive Systems, Information Retrieval and Text Mining

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3276

Loading