Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis

Published: 2025, Last Modified: 23 Jan 2026IEEE Trans. Neural Networks Learn. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Code summarization is designed to generate descriptive natural language for code snippets, facilitating understanding and increasing productivity for developers. Previous research often overlooks the semantic connection between code and its natural language description, resulting in a noticeable gap and suboptimal solution. To address this issue, we introduce a semantic-aligned code summarization framework that leverages crucial data flow information from code for semantic analysis, ensuring alignment between code and summaries. Specifically, we utilize a semantic extraction module (SEM) to decipher the meaning of code and align it with natural language through a semantic alignment module. In the SEM, we construct a code graph that includes data flow edges using static program analysis techniques. Then, on this well-constructed code graph, we innovatively adopt a walking algorithm guided by data flow to extract the semantics of the code. This walking algorithm understands code semantics by analyzing the information transfer between variables during the program execution process. In the semantic alignment module, we integrate a contrastive learning loss mechanism for semantic alignment, which cohesively maps the semantic domains of code and natural language into a unified vector space. We further theoretically analyzed that the data-flow-guided walking algorithm can ensure capturing semantically highly related nodes in shorter paths. Extensive experiments on two benchmark datasets demonstrate the efficacy and broad applicability of the framework.
Loading