EpiCoder: Encompassing Diversity and Complexity in Code Generation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces a novel feature tree-based synthesis framework for generating diverse and complex code instruction data, and presents the EpiCoder series that achieve SOTA in both function and file-level code generation tasks.
Abstract: Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available.
Lay Summary: AI models can generate computer code, but they often rely on small pieces of existing code to learn from, which limits how complex or realistic the generated code can be. Our research introduces a new approach that learns from the structure and patterns of code in a more organized way—like building a tree of features that represent how code works at different levels. This "feature tree" helps the AI understand both simple and complex parts of programming, and lets us control how detailed or large the generated code should be—ranging from a single function to entire software projects. We also created a series of improved AI models, called EpiCoder, which outperform existing tools on standard tests. Our results suggest that this method could help generate high-quality code, potentially benefiting developers, educators, and AI models.
Link To Code: https://github.com/microsoft/EpiCoder
Primary Area: Deep Learning->Large Language Models
Keywords: large language model, code generation, code, data synthesis
Submission Number: 2209
Loading