ODASim: Ordered, Distinctive and Absolute Semantic Similarity for Code Explanation Evaluation

ODASim: Ordered, Distinctive and Absolute Semantic Similarity for Code Explanation Evaluation

ACL ARR 2026 January Submission9448 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code generation and understanding, Evaluation, Semantic Similarity

Abstract: Code explanations are increasingly generated by large language models and used in software engineering workflows, making reliable evaluation essential. However, existing model-based and embedding-based methods often fail to distinguish correct explanations from partially or fully incorrect ones, and their similarity scores are poorly calibrated and do not reflect meaningful differences in explanation quality. To address this, we propose ODASim(Orderly, Dstinctive, and Absolute Similarity), a model-agnostic graded fine-tuning framework for embedding models that learns calibrated similarity representations between code and explanations. To support fine-grained supervision and evaluation, we also introduce ODA-X, a novel benchmark for code-to-explanation quality grading, comprising code–explanation pairs graded similarity labels derived from strategic perturbations of gold explanations. We apply our ODASim approach to multiple embedding models and evaluate it on two benchmarks: widely popular CodeXGLUE and our proposed benchmark ODA-X, spanning four programming languages - Python, Java, JavaScript, and Go. Results show that our method achieves up to 35% improvement in F1 score and 85% reduction in Expected Calibration Error (ECE), enabling reliable evaluation of code to explanation quality.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: code generation and understanding, evaluation, Summarization

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: Java, JavaScript, Go and Python

Submission Number: 9448

Loading