Structured Fine-Tuning Enables Data-Efficient Adaptation of Code Language Models

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Code Representations, Parse Trees, Structured Fine-Tuning, Pre-Trained Code Models
TL;DR: We explore a data-efficient adaptation of pre-trained code language models by further training and fine-tuning them with program structures.
Abstract: Current models tailored for code tasks often adopt the successful pre-training-then-fine-tuning paradigm from natural language processing, treating source code in plain text as in natural language. This approach, however, overlooks the well-defined and unambiguous structures inherent in programming languages. In this work, we explore a data-efficient adaptation of pre-trained code language models by further training and fine-tuning them with program structures, which significantly improve the performance of the downstream coding tasks. Specifically, we represent programs as parse trees, also known as concrete syntax trees (CSTs), and refine a model with serialized CSTs. Fine-tuning with structures encourages the model to learn not only the associations of code text in different languages but also the mappings of their structures and grammars, by using only a small amount of data (e.g., 100 examples). With a focus on generation, we design training objectives for encoder-decoder and decoder-only architectures. We rigorously evaluate the proposed approach on various coding tasks and demonstrate that integrating parse structures with the plain-text representation of source code offers notable advantages, particularly in scenarios of low-data code translation.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6558
Loading