CodeXData: Do Code Generating Language Models Understand Data?Download PDF

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone
Abstract: Large language models (LLMs) are effective at code generation. Certain code tasks, such as data wrangling or analysis, can be data-dependent. To study the extent to which code generating models condition on input data, we define two novel data-centric taxonomies that characterise (1) the data required to complete a task and (2) the data available for a given task. Our system CodeXData generates Python code under various taxonomy configurations, given an underlying LLM such as Codex or InCoder. To evaluate CodeXData, we curate two new datasets for Python code generation from natural language for data-centric tasks. We evaluate these datasets by varying configurations over our taxonomies and find that performance varies based on the task class, data access, and prompting strategy. This is the first empirical measurement of the impact of data in the NL-to-code setting using LLMs for data-centric tasks.
Paper Type: long
Research Area: Generation
0 Replies

Loading