Abstract: Embedding-based information retrieval models can suffer from false retrievals due to issues such as training data bias and polysemy. To address this problem, we propose a novel method for controlling embedding models through an interpretable steering technique based on Sparse Autoencoders (SAEs). SAEs decompose embeddings into semantically disentangled features, and a steering vector selectively enhances or suppresses features that contribute to false retrievals, thereby correcting the search results. Experimental results demonstrate that the proposed method effectively rectifies false retrievals within a limited range while maintaining the generalization performance of the model. However, limitations of the SAE, including potential performance degradation, possible side effects from polysemantic features, and the difficulty in determining optimal correction values, indicate the need for further research. Future work should focus on overcoming these limitations and expanding the scope of the interpretlalbr steering technique to build a more sophisticated search result correction system.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, Dialogue and Interactive Systems, Information Retrieval and Text Mining
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3276
Loading