Code Means More Than Plain Language: Bringing Syntax Structure Awareness To Algorithmic Problem Solution Generation
Keywords: program synthesis, transformer, syntax structure
TL;DR: The first work to introduce syntax tree structure in programming synthesis
Abstract: Program Synthesis (PS) is the task of building computer programs that satisfy problem specifications. Large-scale pre-trained language models treat the PS as a sequence prediction task, which has gained vivid popularity recently. However, these methods heavily rely on the conventional Natural Language Processing (NLP) tokenizers, which overlooks the rich structural/syntax information in the code. In this work, we posit that the syntax structures help generate syntax error-free and algorithmically correct programs. If the program syntax structures can be integrated into the tokenizer, the program representation space could be significantly simplified. To this end, we propose a new end-to-end framework named ASTer, coupled with our novel syntax-aware tokenization design toolkit. More specifically, our tokenizer encodes and decodes the program by its syntax roles and contents, not by what is superficially shown on the strings. The ASTer encompasses a novel sample-wise and token-wise attention mechanism, and avails the benefits of training with the syntactically aligned samples from our tokenization toolkit. Extensive evaluations show superior performance against state-of-the-arts, which confirms that bringing syntax knowledge into the language model can help better capture the data structure and simplify the search space. All of our codes will be publicly available upon acceptance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
4 Replies
Loading