Learning Scalable Representation for Source Code

Anonymous

Learning Scalable Representation for Source Code

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: This paper presents a scalable distributed code representation (SDCR) learning technique, which addresses the most common sparsity and out-of-vocabulary (OoV) concerns simultaneously. We introduce abstract syntax tree (AST) to reflect the structural information of code snippet and adopt the well-recognized 'bag of AST paths' as its intermediate representation, so that the unique structural and syntactic information of programs can be captured. Our proposed SDCR is supported by two core pillars. First, we provide comprehensive empirical study showing that only 1% of the AST paths can account for approximately 75% of the AST path occurrences. That is, dropping most of unnecessary AST paths still allows SDCR to perform well. Second, all AST paths (without leaf nodes in AST) are made up of a limited number of descriptive path elements, for which a lightweight encoder may produce a good embedding of any AST path. Incorporating these two pillars enables us to represent code snippets with better generalizability and scalability. Based on extensive experiments on two real-world datasets, we show that our SDCR have superior performance against the state-of-the-art with nearly 40% reduction in the number of model parameters.

0 Replies

Loading