Keywords: Code Retrieval, Code Similarity, Optimal Module Matching, Structural Code Alignment
Abstract: Code retrieval for programs remains challenging due to structural misalignment of fine-grained similarity metrics. In this paper, we propose the Length-Aware Optimal Module Matching (LOMM) framework that aligns code modules across programs and aggregates similarity scores in a global optimal manner. The proposed method is model-agnostic and can be applied to both embedding-based retrieval and structural metrics.
We evaluate our approach on two datasets, including {LongCCD}, a new long-code retrieval dataset designed with a large corpus-to-query ratio to encourage model-specific optimal matching. Across diverse embedding models, Our method consistently improves retrieval performance, yielding relative gains of approximately 15--20\% in NDCG@5 compared to monolithic baselines.
Paper Type: Short
Research Area: Code Models
Research Area Keywords: code retrieval, code search, formal methods with LLMs
Contribution Types: NLP engineering experiment
Languages Studied: Python, C++
Submission Number: 10293
Loading