Abstract: Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task --- with over 100\% relative improvement on the latter --- at the cost of a moderate increase in computation time.
Keywords: deep learning, graph neural network, open vocabulary, natural language processing, source code, abstract syntax tree, code completion, variable naming
TL;DR: We show that caching out-of-vocabulary words in a graph, with edges connecting them to their usages, and processing it with a graph neural network improves performance on supervised learning tasks on computer source code.
Code: [![github](/images/github_icon.svg) mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary--Code_Preprocessor](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary--Code_Preprocessor) + [![Papers with Code](/images/pwc_icon.svg) 2 community implementations](https://paperswithcode.com/paper/?openreview=SkNSehA9FQ)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/open-vocabulary-learning-on-source-code-with/code)
17 Replies
Loading