CodeFusion: A Pre-trained Diffusion Model for Code Generation

Mukul Singh; José Cambronero; Sumit Gulwani; Vu Le; Carina Suzana Negreanu; Gust Verbruggen

CodeFusion: A Pre-trained Diffusion Model for Code Generation

Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Suzana Negreanu, Gust Verbruggen

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Natural Language Generation

Submission Track 2: Machine Translation

Keywords: Text-to-code generation, Diffusion models, Program synthesis, Language models

TL;DR: A pre-trained diffusion code generation model that employs an attention-based decoding strategy and a mask-denoising objective to generate code.

Abstract: Imagine a developer who can only change their last line of code—how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

Submission Number: 4394

Loading