Beyond syntax trees: learning embeddings of code edits by combining multiple source representationsDownload PDF

Anonymous

04 Mar 2022 (modified: 05 May 2023)ICLR 2022 Workshop DL4C Blind SubmissionReaders: Everyone
Keywords: code2vec, code2seq, commit2vec, code edit representation, commit representation, abstract syntax tree, control flow graph, data flow graph, code embedding
TL;DR: A model to learn embeddings of code edits that combines multiple path-based representations derived from the Abstract Syntax Tree, the Control Flow Graph and the Data Flow Graph of the code.
Abstract: Learning efficient distributed representations of code edits is fundamental for various software engineering tasks, such as the automatic identification of commits that introduce or correct vulnerabilities. Some successful models, including commit2vec and edit2vec, represent code changes using paths extracted from the Abstract Syntax Trees (AST). Other works have shown that, in addition to the AST, considering graph structures that encode the control flow and the data flow of a program can lead to more effective code embeddings. In our work, we introduce a new model to represent code edits that leverages different paths derived from the AST, the Control Flow Graph (CFG) and the Data Flow Graph (DFG). Our preliminary evaluation on the task of classifying security-relevant commits yielded encouraging results that call for further investigation.
1 Reply

Loading