Revisiting Code Similarity through Optimal Module Alignment

Revisiting Code Similarity through Optimal Module Alignment

ACL ARR 2026 January Submission10293 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Retrieval, Code Similarity, Optimal Module Matching, Structural Code Alignment

Abstract: Code retrieval for programs remains challenging due to structural misalignment of fine-grained similarity metrics. In this paper, we propose the Length-Aware Optimal Module Matching (LOMM) framework that aligns code modules across programs and aggregates similarity scores in a global optimal manner. The proposed method is model-agnostic and can be applied to both embedding-based retrieval and structural metrics. We evaluate our approach on two datasets, including {LongCCD}, a new long-code retrieval dataset designed with a large corpus-to-query ratio to encourage model-specific optimal matching. Across diverse embedding models, Our method consistently improves retrieval performance, yielding relative gains of approximately 15--20\% in NDCG@5 compared to monolithic baselines.

Paper Type: Short

Research Area: Code Models

Research Area Keywords: code retrieval, code search, formal methods with LLMs

Contribution Types: NLP engineering experiment

Languages Studied: Python, C++

Submission Number: 10293

Loading