Codex hacks HackerRank: Benefits and Risks of Large-Scale Source Code ModelsDownload PDF

Anonymous

04 Mar 2022 (modified: 05 May 2023)ICLR 2022 Workshop DL4C Blind SubmissionReaders: Everyone
Keywords: machine learning for source code, evaluation of code models, software engineering
TL;DR: Codex hacks HackerRank, solves 96% of the problems in a zero-shot setting, but it appears to be parroting memorized code.
Abstract: The Codex model has demonstrated extraordinary competence in synthesizing working code from natural language problem descriptions (Chen et al. 2021). However, in order to reveal unknown failure modes, and uncover hidden biases, such large-scale models must be systematically subjected to multiple evaluations. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problems from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python---solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex shows signs of producing memorized code, which is alarming, especially since the adoption and use of such models directly impacts how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent benefits and risks associated with large-scale language models such as Codex.
1 Reply

Loading