SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
Abstract: In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant.
Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8\% / 9.4\% / 7.0\% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88\% Pass@1 on HumanEval).
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: code models, retrieval-augmented generation, code generation and understanding, retrieval
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 1664
Loading