MD3R: Minimizing Data Distribution Discrepancies to Tackle Inconsistencies in Multilingual Query-Code Retrieval

Published: 07 Jul 2025, Last Modified: 07 Jul 2025KnowFM @ ACL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual, Query-Code Retrieval, Data Distribution Discrepancies
Abstract: Multilingual Code Retrieval (MLCR) is a critical task for supporting modern software development workflows that increasingly involve multiple programming languages. While existing methods have shown progress, MLCR still faces two core challenges: firstly, the data distribution discrepancy caused by training on single query-monolingual code pairs leads to inconsistency in cross-lingual retrieval; secondly, the data scarcity of certain languages in specific domains limits the effectiveness of consistent representation learning. To address these issues, we first analyze the inconsistency from two perspectives: modality alignment direction error and model weight error. We derive an upper bound for the weight error to quantify the impact of inconsistency and find that this upper bound primarily stems from data distribution discrepancies during the training process. Based on this theoretical analysis, we propose a novel Cross-lingual Consistent MLCR scheme call MD3R (Minimizing Data Distribution Discrepancies). Our scheme employs tailored contrastive learning strategies, including co-anchor contrastive learning (CACL) and 1-to-k contrastive learning (KCL), aimed at mitigating the impact of data distribution bias, thereby enhancing cross-lingual embedding alignment and retrieval consistency. In the widely used CodeSearchNet benchmark, our method achieves breakthroughs in both retrieval recall and consistency metrics across six mainstream programming languages, including python, attaining state-of-the-art performance.
Archival Status: Non-archival (not included in proceedings)
Submission Number: 44
Loading