Abstract: Code representation learning is an important way to encode the semantics of source code through pre-training. The learned representation supports a variety of downstream tasks, such as natural language code search and code defect detection. Inspired by pre-trained models for natural language representation learning, existing approaches often treat the source code or its structural information (e.g., Abstract Syntax Tree or AST) as a plain token sequence. Unlike natural language, programming language has its unique code unit information (e.g., identifiers and expressions) and logic information (e.g., the functionality of a code snippet). To further explore those properties, we propose Abstract Code Embedding (AbCE), a self-supervised learning method that considers the abstract semantics of code logic. Instead of scattered tokens, AbCE treats an entire node or a subtree in an AST as a basic code unit during pre-training, which preserves the entirety of a coding unit. Moreover, AbCE learns the abstract semantics of AST nodes via a self-distillation way. Experimental results show that it achieves significant improvements over state-of-the-art baselines on code search tasks and comparable performance on code clone detection and defect detection tasks even without using contrastive learning or curriculum learning.
Loading