Title: Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation

Abstract: Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage selftests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FUNCODER, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FUNCODER recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FUNCODER outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FUNCODER, StableCode 3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.

Section: Introduction
The landscape of code generation has been dramatically reshaped by the advent of large language models (LLMs) (OpenAI, 2023; Touvron et al., 2023), alongside specialized coding models (Rozière et al., 2023; Lozhkov et al., 2024; Pinnaparaju et al., 2024). While these models excel at generating straightforward code snippets, their performance significantly degrades when confronted with intricate and multifaceted programming requirements. This performance gap highlights a critical challenge: enabling LLMs to reliably produce complex, correct, and robust code.

To address this challenge, various strategies have been explored. Two-stage methods (Jiang et al., 2023; Zelikman et al., 2023) adopt a plan-and-solve paradigm, initially generating a high-level outline before proceeding to detailed code implementation. Multi-agent development frameworks (Hong et al., 2024; Qian et al., 2023) simulate real-world software engineering processes by assigning distinct roles to LLMs that collaborate towards a common goal. Furthermore, self-improvement techniques (Shinn et al., 2023; Chen et al., 2024) aim to refine generated programs using execution feedback derived from automatically generated unit tests.

Despite these advancements, persistent limitations hinder the robust generation of complex code. First, two-stage planning approaches often necessitate a complete upfront design, lacking the flexibility to dynamically adapt the top-level plan during the iterative implementation process, which can lead to suboptimal problem decomposition. Second, multi-agent collaboration frameworks are inherently resource-intensive and heavily dependent on the capabilities of powerful LLMs, limiting their generalizability and applicability to smaller, open-source models. Third, and critically, code refinement through self-tests is fundamentally constrained by the accuracy and reliability of the generated unit tests. Our preliminary investigations ( §3.1.3) reveal that LLMs frequently produce an abundance of unreliable self-tests, which can mislead the self-improvement process and, in severe cases, exacerbate existing program errors.

To overcome these critical limitations, we introduce FUNCODER, a novel code generation framework built upon a dynamic divide-and-conquer strategy coupled with an innovative functional consensus mechanism. FUNCODER addresses complex problems by recursively decomposing them into a hierarchical tree of interdependent sub-functions. This top-down decomposition reduces the complexity of individual generation steps, while a bottom-up composition phase aggregates these solved sub-functions to achieve the overall objective. While decomposition simplifies tasks, it also introduces the risk of cascading errors where faults in sub-functions propagate upwards. To mitigate this, FUNCODER employs functional consensus, a mechanism that samples multiple implementations for each function and selects the most robust one by evaluating the aggregated behavioral similarity among candidates. This consensus-driven selection significantly reduces inconsistencies in program behavior and prevents the propagation of errors, thereby enhancing the overall reliability and correctness of the generated code.
Algorithm 1 FUNCODER procedure Require: Entry func, f root = {h root , d root , ϕ} Require: Large language model, LLM 1: function FUNCODER(f cur )
2:
-Divide -3:
f ′ cur , {f i } ← EXTRACT(LLM(f cur ))
4:
for f i ∈ {f i } do 5:
if b i is NOTIMPLEMENTED then 6:
f * i ← FUNCODER(f i ) ▷ recursion 7:
end if 8:
ADDCHILD(f cur , f * i )
9:
end for 10:
-Conquer -  \"\"\"return factors of x\"\"\" raise NotImplementedError() def sum_common_factors(a, b): \"\"\"sum the common factors\"\"\" fa = find_factors(a) fb = find_factors(b) return sum_common(fa, fb)
def sum_common(a: list, b: list) -> int:
\"\"\"sum of common elements\"\"\" raise NotImplementedError()

Section: Introduce new functions
Figure 2: Left: Algorithm for FUNCODER, explained in detail in Appendix A.6. Right: Comparison between decomposition by planning and our approach. FUNCODER introduces new functions to describe sub-goals solely with code, achieving a more natural way of requirement decomposition.
2 FUNCODER: Divide-and-Conquer Meets Consensus

Section: Divide-and-Conquer for Iterative Programming
A function is defined as a relation between a set of inputs and outputs where each input is assigned exactly one output (Halmos, 1998), denoted as y = f (x). In computer programming, a function is identified by its header h f with its body b f , and is commonly accompanied by a documentation d f to improve readability. Functions can be invoked from other procedures, allowing for the decomposition of large and complicated requirements into smaller structures that exhibit high comprehensibility and quality (Dahl et al., 1972). Generally, human programmers tend to decompose tasks into clearly defined sub-functions and then implement them recursively, making functions eligible for re-usage, taking advantage of the divide-and-conquer principle. Inspired by this, FUNCODER recursively divides the requirement and conquers functions to formulate a sophisticated solution, unleashing the potential of LLMs in code generation.
Divide is a top-down process that iteratively breaks down problems. Given a code generation problem, the process begins from the entry function f root . We instruct the model to introduce new functions f i ∈ CHILD(f cur ) that solve certain sub-goals while writing the current f cur . To reduce the complexity involved in each generation, we only require the headers h fi and documentation d fi of new functions to be generated, while their implementations b fi can be postponed. After completing the current function, the model starts to address those unimplemented sub-functions and complete b fi into f ′ i . This process stops when the model deems functions too simple to be further divided, finally forming a dependency tree T = TREE(f root , CHILD(f root )). The divide process is similar to a search starting from the entry function, gradually involving new sub-functions while writing the current, and implementing them recursively. We guide the entire process through a depth-first search.
Conquer is a process of achieving complex objectives through aggregating smaller functions. We notice that child functions are not yet implemented during the top-down process of writing parent functions. As a result, these parent functions may not be able to effectively utilize the child functions, or misuse them at worst. FUNCODER deals with this issue by re-generating functions in inverse topological order on the dependency tree T -starting from leaves, complex goals are handled by compositing solved children as
f * cur ← F(f ′ cur , {f * 1 , f * 2 , . . . }) | f * i ∈ CHILD(f cur
). Divide and conquer naturally achieve both decomposition and composition during code generation. Unlike two-stage and agent-based methods, our approach dynamically introduces new functions along the process, making it less burdensome than producing a complete plan at the very beginning. Moreover, while planning or agents require chat capabilities, FUNCODER represents sub-tasks through functions (Figure 2), making it more applicable to specialized code generation models.

Section: Functionality Similarity as a Consensus
The decomposition of complex tasks benefits from solving easier sub-goals, but might introduce the risks of cascading errors, which refers to errors in sub-functions that lead to errors in ancestral functions. To mitigate this, we introduce Functional Consensus which aims at reducing inconsistencies in program behavior. This is achieved by sampling multiple functions and selecting the one that exhibits consensus, as measured by the aggregated similarity of functionality between candidates, thus abating outlier functionalities.
Functionality Similarity A program specifies its functionality (or behavior) through the control flow and logic defined by its code semantics. However, comparing the functionalities between two programs based on their semantics is somewhat challenging. By decomposing the requirement into functions, FUNCODER is able to view the function behavior as a black box that maps arguments into return values. Considering two functions f and g with the same input domain D(f ) = D(g), we define the similarity between them sim(f, g) as the identicalness of outputs when given the same input values.
sim(f, g) = x∈D(f ) 1 [f (x) = g(x)] |D(f )| ≈ x∈X|X∼D(f ) 1 [f (x) = g(x)] |X|(1)
The similarity becomes 1 if and only if two functions output consistent values for all inputs: ∀x ∈ D(f
) : f (x) = g(x) ⇔ sim(f, g) = 1.
We notice that the input domain D(f ) is unbounded in most cases, making its measurement barely feasible in practice. Thus, we approximate it by sampling a subset of possible inputs X ∼ D(f ) with an LLM.
Consensus is reached by selecting the candidate f * holding maximal similarity with others after sampling multiple function implementations F = {f (i) } for the same requirements.
f * = FUNCONSENSUS(F ) = arg max f (i) ∈F f (j) ∈F \{f (i) } sim(f (i) , f (j) )(2)
By introducing functional consensus, FUNCODER produces functions that are more consistent and common in functionality, while omitting abnormal samples. The process is applied to not just the final program, but also to every sub-tree during the bottom-up conquering stage, resulting in step-by-step, thorough verification from the most fundamental functions all the way up to the whole program.

Section: FUNCODER is a Function Coder
We design FUNCODER as a procedure that takes a problem in the form of a function signature f (x), and produces a final solution f * (x), as exemplified in Figure 1. Given a problem f (x), FUNCODER partially implements the function as f ′ (x) referring to unimplemented sub-functions g(y) and h(z). These sub-functions are then fed into FUNCODER to be recursively coped with. We then sample k implementations f ′ (i) (x) based on solved children g * (y) and h * (z). Functional consensus is calculated by evaluating candidates on possible inputs. The function sharing maximal behavioral similarity is combined with solved children to formulate the final solution.

Section: Experiments
We conduct experiments on competition-level code generation and mathematical reasoning benchmarks with state-of-the-art LLMs, which are covered in section §3.1 and §3.2, respectively. In addition to GPT models (Ouyang et al., 2022;OpenAI, 2023), we also conduct experiments with community models like Llama3 8b (Meta AI, 2024), StableCode 3b (Pinnaparaju et al., 2024), and CodeLlama 34b (Rozière et al., 2023). We use the instruct variant of these models and inference on a single A100-80G under BF16 precision with vLLM (Kwon et al., 2023). 

Section: Code Generation
We choose three benchmarks for code generation evaluation: (a) HumanEval (Chen et al., 2021) includes entry-level coding questions; (b) MBPP (Austin et al., 2021) contains questions of standard library invocation and programming basics; and (c) xCodeEval (Khan et al., 2023) consists of algorithmic challenges sourced from the competitive programming platform CodeForces.

Section: Experiment Setup
Benchmarks We adopt the full test set (164 problems) for HumanEval, and sample 200 for MBPP and 500 for xCodeEval, respectively. Following EbTech (2024), we split the xCodeEval into 4 subsets based on problem difficulty: Easy (≤ 1200), Mid (1200-1599), Hard (1600Hard ( -1999) ) and Expert (≥ 2000). The evaluation metric for code generation is Pass@1 unless specified.
Baselines We compare FUNCODER with standard prompting (Brown et al., 2020), two-stage decomposition method Parsel (Zelikman et al., 2023), self-testing method CodeT (Chen et al., 2023a), self-improvement methods Reflexion and LDB (Shinn et al., 2023;Zhong et al., 2024), and multiagent developing framework MetaGPT (Hong et al., 2024). We implement Standard prompting with a 1-shot demonstration. CodeT samples 11 solutions with standard prompting and evaluates them on model-generated tests. The results for Reflexion are reproduced from the original code.
Implementation Details FUNCODER uses a 2-shot prompt in the divide stage and 1-shot for conquering sub-functions. The number of sampled implementations in the functional consensus is set to 11 for code generation tasks. For further implementation details, please refer to Appendix A.1.

Section: Results
Table 1 shows the code generation performance on advanced proprietary models, GPT-3.5 (Ouyang et al., 2022) and GPT-4 (OpenAI, 2023). For basic programming questions, HumanEval and MBPP, FUNCODER surpass previous SOTA methods by +3.3% in Pass@1 and reduce the error rate by 18.6%. Furthermore, FUNCODER demonstrates a substantial improvement on competition-level problems, outperforming others by 10.4% in GPT-4 and 35.3% with GPT-3.5. We observe that FUNCODER can  11.
Experiment results demonstrate that our method archives state-of-the-art performance on various models, ranging from basic programming to competition contests.

Section: Analysis
FUNCODER Democratize to Smaller LLMs Limited by the LLM capabilities, the application of selfimprovement or multi-agent methods on smaller models is without ease. By keeping decomposition and composition within the code generation process, our approach exhibits better generalization. As shown in Table 1, with FUNCODER, StableCode 3b achieves around 118.6% relative performance to standard GPT-3.5, and also aligns closely with GPT-4 by about 97.7% on HumanEval.
Preliminary Study on Self-Testing Method We conduct a preliminary study targeting the self-testing method on HumanEval, results are shown in Figure 3.a with further details in Appendix A.5. We first verify whether model-generated programs can also pass model-generated self-tests: (a) If a program passes self-tests, most from GPT-3.5 would also work on system tests, as much as 19.5% /64% ≈ 30.5% programs from StableCode are rejected, indicating that smaller models like StableCode may not effectively self-test and detect program errors on its own. (b) In the event of failed self-tests, a large portion of failures are attributed to issues in self-tests instead of the programs, on both GPT-3.5 and StableCode. These phenomena indicate that self-testing methods have limitations in generating correct and reliable unit tests. As a result, we design functional consensus to not require any assertion, but perform mutual verification between solutions instead, as opposed to self-testing.
Effectiveness of Functional Consensus Functional consensus or self-testing may be viewed as ranking algorithms for selecting functions. To measure ranking effectiveness, we conduct an analysis on HumanEval with GPT-3.5. For each problem, 11 candidates are ranked with 3 strategies: consensus, self-test, and random shuffle (as a baseline). Effectiveness is measured via Pass@k, i.e. if any of the top-k ranked programs pass the system test. Figure 3.b shows that functional consensus achieves 94.7% upper bound (Pass@11) performance by selecting a single function (Pass@1), and is close to that of self-test on Pass@4. This clearly demonstrates that functional consensus can effectively evaluate correctness and pick the most promising implementation on the first attempt.
Ablation and Token Usage To analyze the impact of dividing, conquering, and functional consensus in FUNCODER, we carry out an ablation study with different settings. Studies that replace consensus with self-testing, or with AlphaCode-like (Li et al., 2022) clustering, are also included. The ablation is constructed on HumanEval with GPT-3.5, as shown in 

Section: Mathematical Reasoning
Code can be viewed as a tool for augmenting the reasoning capabilities of LLMs (Chen et al., 2023b).
Alternative to text-based reasoning like Chain-of-Thought (Wei et al., 2022), programs can offer unique advantages in terms of iteration and calculations. To test the generalizability of FUNCODER beyond algorithm challenges, we conduct an experiment on MATH (Hendrycks et al., 2021b), a competition-level mathematical reasoning benchmark.

Section: Experiment Setup
Benchmark The experiment is conducted on a subset of the MATH test set, including 500 randomly sampled problems that can be classified into 7 disjoint subjects or 5 difficulty levels. It can be noticed that labels in MATH are formatted in L A T E X, rendering exact-match verdicts impractical. We, therefore, follow previous work (Zhang et al., 2024) and adopt GPT-4 to determine the correspondence between predictions and labels, with further details provided in Appendix A.4.
Baselines We compare FUNCODER with the text-based baselines: Standard Prompting and Chainof-Thought (Wei et al., 2022), and program-aided baselines: Program-of-Thought (Chen et al., 2023b), Self-Refine (Madaan et al., 2023), Cumulative Reasoning (Zhang et al., 2024). The results of Cumulative reasoning are reported in the original paper. Standard prompting and chain-of-thought reasoning use 7-shot demonstrations constructed from the train set. Program-of-Thought and Self-Refine prompt the model with 1-shot demonstration to generate a solution() function that solves the problem. Additionally, self-refine iteratively refines programs based on runtime feedback. All baseline methods are run with self-consistency (Wang et al., 2023) at 5.
Implementation Details FUNCODER adopts a program-aided reasoning setting that writes a solution() function and obtains the final prediction by running this program. The number of sampled implementations |F | in functional consensus is set to 5 to match baseline methods.

Section: Results
The experimental results on MATH are shown in Table 3. It shows that program-aided reasoning generally outperforms text-based reasoning. With GPT-4 as the backbone, FUNCODER outperforms the strongest baseline Cumulative Reasoning (Zhang et al., 2024) by (6.0 / 8.3%) and surpasses the vanilla program-aided baseline PoT (Chen et al., 2023b) by (10.0 / 14.7%). When using GPT-3.5turbo as the backbone, FUNCODER exceeds the strongest baseline by (6.2 / 11.1%) and outperforms PoT by as much as (13.0 / 31.7%), which indicates that our approach has a strong advantage over both text-based reasoning and other program-aided reasoning methods. On open-source models, FUNCODER with Llama3 outperforms PoT by (12.4 / 38.0%). It has even reached competitive performance against the state-of-the-art method based on GPT-3.5 (45.0 v.s. 48.6). When employing StableCode and CodeLLaMA as the backbone, our approach achieves significant improvements by (12.2 / 84.7%) and (9.2 / 60.5%), respectively. This improvement demonstrates that our approach can significantly boost smaller LLMs, democratizing the complex reasoning capabilities of open-source LLMs through programming. This suggests that text-based reasoning is inadequate for tackling challenging mathematical reasoning problems. The same situation is also observed in PoT. In contrast, our method consistently demonstrates high performance even on challenging problems, particularly excelling on level 5 difficulty with nearly double the performance compared to PoT and CoT. This reflects that our method, with divide-and-conquer applied, can effectively cope with complex problems.

Section: Analysis
Decomposed Functions are Domain-Specific We hypothesize that questions from the same subject require similar knowledge reserves, which should be reflected in the functionality of the sub-functions.
To verify this hypothesis, we statisticize the common sub-functions of FUNCODER in each MATH subject, as shown in Table 4. It is apparent that different subjects require different abilities, each with its own set of sub-functions closely associated with the domain knowledge. In addition, these common sub-functions are fundamentally basic and straightforward. As exemplified in Appendix B.2, our method is able to leverage and combine these basic sub-functions to achieve more complex goals, thereby reducing the complexity of reasoning and enhancing performance. Large Language Model for Code Code pre-training has received widespread attention, with early models based on small language models (SLM) (Feng et al., 2020;Lu et al., 2021;Wang et al., 2021).
In recent years, with the development of large-scale pre-training techniques, code LLM has emerged, showing remarkable performance in downstream code tasks (Chen et al., 2021;Nijkamp et al., 2023;Li et al., 2022;Rozière et al., 2023;Li et al., 2023b;Guo et al., 2024). Tasks between code and natural language (NL) can be generally divided into three major categories: NL2Code tasks such as code generation (Austin et al., 2021;Chen et al., 2021;Hendrycks et al., 2021a;Khan et al., 2023) and code search (Husain et al., 2019); Code2Code tasks including code completion (Lu et al., 2021;Zhang et al., 2023;Liu et al., 2024), code translation (Ahmad et al., 2023;Zhu et al., 2022;Yan et al., 2023), and test generation (Siddiq et al., 2023;Schäfer et al., 2024); Code2NL tasks like code summarization (Jin et al., 2023). This paper focuses on code generation tasks, ranging from basic to competition level.
Code Refinement and Self-Testing Code doesn't always run as expected; it could contain syntax errors, dead loops, or bugs. It's essential to debug and refine the code to ensure better quality.
CodeT (Chen et al., 2023a) generates unit-tests to score the implementation. AlphaCode (Li et al., 2022) clusters programs based on whether generated program outputs were identical or not. Selfimprovement methods (Madaan et al., 2023;Shinn et al., 2023;Chen et al., 2024;Zhong et al., 2024) design closed-loop procedures that repeatedly refine the code based on the feedback. Like real-life software development processes, multi-agent frameworks (Hong et al., 2024;Qian et al., 2023) construct specific LLM roles, Tester or QA to generate tests. These studies adopt a shared paradigm wherein self-tests are generated through LLMs. However, Olausson et al. (2024) points out the challenge that LLMs have certain shortcomings in self-repairing their code. This paper avoids these shortcomings by proposing functional consensus as a reliable method of evaluation.
Program-Aided Reasoning and Agents Aside from code generation tasks, the program can be a tool that augments LLM to solve complex reasoning questions or interact with external environments. Program-of-Thought (Chen et al., 2023b) and PAL (Gao et al., 2023)  Decompose for Complex Problems Several recent works employ decomposition to reduce the complexity of hard problems. Least-to-Most (Zhou et al., 2023) adopts a two-stage approach, which first decomposes complex problems, and then solves each sub-problem individually to tackle complex reasoning tasks. Successive Prompting (Dua et al., 2022) adopts a dynamic decomposition, iteratively breaking down problems and addressing sub-problems. Tree-of-Thought (Yao et al., 2023) breaks down complex problems into state spaces and uses tree search to solve them. Parsel (Zelikman et al., 2023) introduces decomposition to code generation tasks, taking a three-stage to break down requirements into draft and intermediate parsel programs. RepoCoder (Zhang et al., 2023) performs a retrieval in repositories to complete unfinished code one by one. Unlike these methods, FUNCODER recursively decomposes problems into a tree structure, hence gradually reduces its complexity.

Section: Discussion
Limitations Our approach unleashes the potential power of functions in programming, which is advantageous on well-defined problems such as competitive programming, or program-augmented reasoning tasks. These scenarios do not however represent all use cases, such as open-ended problems or casual software development. Nevertheless, we believe that the idea of divide-and-conquer and sub-modular consensus utilized by FUNCODER can be extended to a wider range of problems, and we consider this as a future exploration.
Broader Impact While code generation is increasingly utilized in software development, Large Language Models (LLMs) are still prone to generating toxic, vulnerable, or malicious code. Such programs pose risks and should be used or executed with extra caution.

Section: Conclusion
In this paper, we presented FUNCODER, a novel code generation framework that integrates the divideand-conquer strategy with functional consensus to address complex requirements. FUNCODER had demonstrated superior performance compared to state-of-the-art methods on various benchmarks and models. Our findings highlighted the effectiveness of dynamic decomposition and functional consensus in writing complex code, which suggests that FUNCODER may have the potential to empower further improvements in code generation and other fields.

Section: A Appendix
In the supplementary materials, we provide the details of implementation (A.1), baseline information and settings (A.2), benchmarks (A.3), metrics (A.4), settings in the analysis (A.5), and additional experiments (A.9). We also demonstrate the example solutions of our method and baseline in Appendix B, and include all the prompts in Appendix C. 

Section: A.1 Implementation Details
Models We access the OpenAI models GPT-3.5 (gpt-3.5-turbo-0613), GPT-4 (gpt-4-1106-preview) and GPT-4o mini (gpt-4o-mini-2024-07-18) through Azure OpenAI. Weights of community models Llama3 (Meta-Llama-3-8B-Instruct), Codestral (Codestral-22B-v0.1), StableCode (stable-codeinstruct-3b), CodeLlama (CodeLlama-34b-Instruct-hf) and StarCoder2 (starcoder2-15b-instruct-v0.1) are downloaded from HuggingFace (Wolf et al., 2019) and served over an OpenAI-like API on a single A100-80G GPU under BF16 precision with vLLM (Kwon et al., 2023).
Divide We instruct the model to write the current function and introduce new functions with clearly defined sub-goals. The prompt C.2 for the divide process includes two examples: one example needs to involve new functions that are left unimplemented; and another where the sub-goal is simple enough that no further decomposition is necessary. The model generates a Python code block with a temperature of 0.2, and the code block will be extracted to represent a tree of functions with new functions as the children of the current. We require that any new sub-function do not refer to existing functions, to avoid circular references. This generation process will be attempted at most 3 times until any valid code with a proper signature is found in the output. FUNCODER then traverses the function tree via depth-first search and restricts the max depth of the tree to 6.
Conquer We apply the composition of sub-functions to rewrite the parent function after all subfunctions have been fully decomposed. Code for sub-functions is made visible to the LLM, which is requested to rewrite the current function with a 1-shot demonstration (C.3). With functional consensus applied, the model samples multiple implementations with a temperature of 0.8, and the one that reaches consensus will be kept for further bottom-up processing.

Section: Functional Consensus
The functional consensus is applied in the conquer stage. Formally, Consensus@k samples k-1 implementations in the conquer stage, and reuses the one produced in the divide stage, resulting in a set F of k candidate programs. Then we prompt the model with 1-shot (C.4) to generate potential inputs X for the given function and use them to feed and execute the program.
As described in Eq 2, when two functions output the same value in a given input, they will both add 1 point to the overall similarity. A thrown exception or timeout during execution assigns -100 points to the candidate as it indicates potentially problematic code. Similar to self-testing methods, we also leverage the example input/output at the root node to filter out candidates that have wrong functionality. Finally, the one candidate with maximum scores over all inputs is selected, as it reaches consensus with other implementations.
Hierarchical Code Interpreter Divide-and-conquer represents the problem hierarchy through structured code. To gain insights of this information, we design an interpreter that syntactically parses the generated output and organizes them into a graph of functions. We are thus able to decompose complex tasks by representing sub-goals through the connections of multiple functions. LLMs may produce vulnerable code even if prompted by trusted inputs, making direct execution or eval() on generated code especially hazardous. Our framework addresses this with the use of a sandboxed environment to contain untrusted code execution, preventing the LLM from hanging up or even breaking the system.

Section: A.2 Baseline Details
Standard Prompting conducts one-time generation and directly output the entire code or final results. In code generation tasks, we use a 1-shot prompting setting with 0.3 temperature. For MATH, we sample 1 question-answer pair per subject in the train set, resulting in a 7-shot prompt, and run self-consistency (Wang et al., 2023) with consistency@5 and temperature 0.7.
CodeT (Chen et al., 2023a) samples multiple code solutions X and unit-tests Y . A unit test is an assertion consisting of given input and expected output, whereas in Python it takes the form of "assert func(x) == y", CodeT then checks the programs over self-tests and divides the functions into sets; the score of such a set is defined as the number of functions within multiplied by the number of succeeded tests. Finally, CodeT selects the function with the most agreement (in the biggest set). Similar to the setting of FUNCODER, we sample 11 candidate solutions with 0.8 temperature.
AlphaCode-like Clustering is introduced with the model AlphaCode (Li et al., 2022) Reflexion (Shinn et al., 2023) is a closed-loop agent system that generates unit tests and iteratively refines the program based on the self-test feedback. The results for GPT-4 on HumanEval and MBPP are reported in the original paper. Based on officially released code3 , we test results with GPT-3.5 and community models under the reflexion strategy with max_iters=2 and Pass@1. For the xCodeEval benchmark, as it is judged through standard input/output, we wrap the standard input into function arguments and obtain the return value as the output in the form of "def main(input_str: str) -> str", and the sample input/output are also transformed to visible tests for reflexion process.
MetaGPT (Hong et al., 2024) employs a multi-agent strategy that assigns roles and encodes humanlike software development procedures. The scripts for reproducing the results were not made public as of this paper was completed. Therefore, we include the original result for GPT-4 on the HumanEval dataset under the with feedback setting.
LDB (Zhong et al., 2024) segments programs into basic blocks and tracks the values of intermediate variables after each block throughout runtime execution, allowing large language models to verify the correctness of smaller code units. We adopt the results as-is reported in the paper.
Chain-of-Thought Prompting (Wei et al., 2022) generates step-by-step reasoning leading to the final output answer. The solution is formatted in L A T E X, and use \boxed to mark the final answer. We sample 1 shot per subject in the MATH train set, resulting in a 7-shot demonstration, and running with consistency@5 and a temperature of 0.7.
Program-of-Thought (Chen et al., 2023b) utilizes the coding ability in LLMs to generate programs rather than text-based solutions for reasoning tasks. In MATH, we hint the model with 1-shot prompting to generate a solution() function that returns the final answer to the problem. The program is then executed in a Python environment and obtains the return value. If an exception is thrown during execution, the model will try to regenerate a new program until it succeeds or reaches 3 attempts. Similar to CoT, Program-of-Thought samples 5 programs at a temperature of 0.7 and votes the final result.
Self-Refine (Madaan et al., 2023) iteratively prompts the model to give feedback and refine the generated code based on it. Self-refine does not incorporate self-tests, and the refinement is conducted solely on model feedback. In our preliminary study on HumanEval, this feedback is weak and cannot improve performance. However, in MATH, the solution program can be executed without the need for generated assertions. Thus, we extend the self-refine to capture the runtime error trace as feedback and refine the code until it can run or exceed 3 retries.
Cumulative Reasoning (Zhang et al., 2024) starts from decomposing the input problem into propositions and conducts bottom-up reasoning until the final answer can be concluded. The results for Cumulative Reasoning are reported in the original paper under with code setting. 

Section: A.3 Benchmark Details


Section: A.4 Metrics
Pass@k When a program is passed (or accepted), it means that the program must pass all system tests without errors and within the time limit. In our experiments, we set the time limit to 2.5 seconds. Pass@k judges k independent programs, and if any of them can pass, the result will be 1.
In most of our experiments, we use Pass@1 as the metric, as it reflects the accuracy of the method framework achieved without feedback from humans. Pass@k, on the other hand, is equivalent to filtering programs through hidden, human-annotated test labels.

Section: EM-GPT
The ground truth label in MATH is written in L A T E X, and the accuracy between labels and model predictions cannot be directly calculated through exact-match (EM). MATH provides a judge program5 that preprocesses LaTeX syntax and check whether two disambiguated strings are equal. However, this is insufficient for evaluating LaTeX-formatted labels with variant program outputs. We follow the evaluation criteria from previous work (Zhang et al., 2024), using GPT-4 to assess the consistency between predictions and ground truths, with prompt shown in C.6.

Section: A.5 Details of Analysis
Details of Preliminary Analysis on Self-testing (Figure 3.a) The preliminary study is conducted on the HumanEval dataset, which includes system tests S to evaluate the accuracy of the program, as well as one human-annotated canonical solution c.  3 and detailed explanations about these conditions can be found in Table 8. -
¬(p, T ) ∧ (p, S) ∧ (c, T )
In the event of self-test failure, there must have been at least one error in either program or tests, so this condition should never occur.
Details of Ranking Strategy Comparison (Figure 3.b) We obtain 11 candidate programs from FUNCODER on HumanEval with GPT-3.5 and rank them through three strategies. This ensures that the same candidate set is used for a fair comparison. An effective ranking strategy should prioritize placing correct programs at the forefront and filter out those with errors. Thus, we measure the effectiveness by computing Pass@k results on the top-k-ranked programs selected by each strategy. The Pass@11 result serves as an upper bound as it uses all programs to compute the pass rate.
How We Count Frequently Used Functions in MATH (Table 4) In the mathematical reasoning experiments, we used a subset of 500 items from the MATH test set, with an average of 71.4 questions per subject. However, it is not very confident to represent common functions from only 71.4 programs. Therefore, we sample 3000 problems from the MATH test set for this experiment and run the divideonly setting of FUNCODER on them. Then, the occurrence of sub-functions is counted based on their names after extracting the function nodes of code trees for each category.

Section: A.6 Detailed Explanation of Algorithm
We hereby provide a detailed explanation of FUNCODER algorithm works, with respect to Algorithm 1 from Figure 2 (a copy is included below for simple reading). As mentioned, FUNCODER is a recursive process following a DFS pattern. We use square brackets (e.g. [L1]) below to denote line numbers in the pseudocode.
Algorithm 1 FUNCODER procedure Require: Entry func, f root = {h root , d root , ϕ} Require: Large language model, LLM 1: function FUNCODER(f cur )
2:
-Divide -
3: f ′ cur , {f i } ← EXTRACT(LLM(f cur )) 4: for f i ∈ {f i } do 5: if b i is NOTIMPLEMENTED then 6: f * i ← FUNCODER(f i ) ▷ recursion 7: end if 8: ADDCHILD(f cur , f * i ) 9:
end for 10:
-Conquer -11:
F cur ← SAMPLE(LLM(f ′ cur , CHILD(f cur )) 12:
f * cur ← FUNCONSENSUS(F cur ) We hierarchically describe how this algorithm works in detail by combining it with the example given in the right half of Figure 5. • FunCoder goes through the Divide stage and the Conquer stage for each of the functions.
• Based on the current function, Divide generates an implementation of itself and stubs for subfunctions. Within this stage, each function would appear at most once in input and twice in output. All Divide stages consume no more than 3N tokens.
• Conquer regenerates the parent function based on its stub and all finalized sub-functions. Herein each function will appear at most twice in input, and sampled k times in output. If k = 1, consensus is implicitly disabled. All Conquer stages shall consume at most (k + 2)N tokens.
So FunCoder requires no more than (k + 5)N tokens in input-output, making its token consumption O(kN ) even at worst-case, aligning with other sampling-based methods such as CodeT and AlphaCode-like clustering. Furthermore, when sampling is disabled (k = 1), our method has a token consumption of O(N ), which also aligns with the vanilla Standard method.

Section: A.8 Discussion About Functional Consensus
This section focuses on why functional consensus might enhance the correctness of programs and how it differs from other consistency-based methods. Self-consistency (Wang et al., 2023) is widely employed in the realm of LLM reasoning. It samples multiple sets of answers and uses voting to select the most consistent result, where the answers typically consist of named entities, choice options, or numbers. However, this approach faces challenges when voting on sampled programs, as programs describe executable logic instead of data, making it unobvious to determine whether two programs are equivalent just from the looks.
When it comes to picking programs, functional consensus in FUNCODER looked beyond the literal symbols and used a different approach. It uses inputs and execution results to compare behavioral differences among programs. There have been similar methods, such as the strict clustering approach in AlphaCode (Li et al., 2022), which samples a set of program inputs and then clustering programs with identical outputs into the same group. The final program is then selected from the largest cluster.
However, the idea of grouping programs by the 'identicalness' of outputs is not without fallacies, since programs rarely specialize in solving one single irreducible problem -they deal with a variety of inputs, conditions and mysterious cases. The result of this, where different solutions could have many common behaviors and some distinct behaviors, is referred to as the term 'special-case similarity' in the FrAngel paper (Shi et al., 2019). We consider a correct program solution which has multiple 'special-case similar' programs that are partially correct in different ways, for example:
• One program behaving correctly on the general case (almost all) but missed a few edge cases
• Another program got one edge case correct but didn't manage to deal with the general case
• Yet another program got all edge cases correct but crashed on the general case
• A buggy program that behaves correctly on all available test cases but none of these tests trigger the bug (literally test coverage problem)
• And many programs that turn out to be frenzy mixtures of all the above If we had a pool of programs that contained the fully-correct program and an assortment of other programs that respected certain cases of the problem as aforementioned, it'd be obvious that the fullycorrect would be decently 'special-case similar' to the rest of the programs, for their similar behavior on inputs. These execution outputs are programmatically obtained and automatically compared against each other without any human intervention or LLM calling required, the process of which sits at the core of our functional consensus algorithm.
Therefore, with functional consensus, where the solutions with common behaviors are promoted, we could intuitively expect the result to be a higher likelihood of a fully-correct program. Provided below is a hypothetical example demonstrating why functional consensus prevails:
Example Consider the problem of finding all square roots in the complex domain of a non-zero real number (stored in float32).
To get the answer right for all inputs, the function must consider 2 cases: A) non-zero numbers have 2 square roots; B) square roots of negative numbers are imaginary. 10 candidate functions are sampled as below:
• 5 results (a 1 , a 2 , a 3 , a 4 , a 5 ) only considered case A and got just positive inputs right. For negative numbers, they literally gave up and crashed.
• 3 results (b 6 , b 7 , b 8 ) remembered to consider case B, gave 2 imaginary results for negative numbers, but forgot to do the same for positive numbers, returning only 1 square root therein.
• Only 2 results (c 9 , c 10 ) considered all cases and returned correct results for all inputs.
If we pick the program through 'clustering', the final result would be one of the 5 results (a 1 , a 2 , . . . , a 5 ) that only considered case A, which is evidently not the correct solution. But with functional consensus, the final result is vastly different, since we consider the similarity between the functions based on their behavior on different inputs. Without loss of generality, suppose that there are 2 test inputs 4.0, -9.0, one for each of the 2 cases. We calculate the similarity as follows:
• Programs a i got only √ 4.0 = [2.0, -2.0] right so each program here are similar with programs (a 1 , a 2 , a 3 , a 4 , a 5 , c 9 , c 10 ), scoring 7 points.
• Since b j only went well with √ -9.0 = [3.0i, -3.0i], programs here only score 5 points for case B with the ones (b 6 , b 7 , b 8 , c 9 , c 10 ).
• Each program in c k gets 7 points for √ 4.0 with (a 1 , a 2 , a 3 , a 4 , a 5 , c 9 , c 10 ), and gets 5 points for √ -9.0 with (b 6 , b 7 , b 8 , c 9 , c 10 ). Totals to 12 points.
The final result apparently leaned towards c k as the correct solution, even if their outputs as a whole weren't even half as much as a i is. Through this example, we illustrate that functional consensus has the potential to identify the correct samples even at their minority, outperforming other methods such as self-consistency or clustering.

Section: A.9 Supplementary Results
Token Usage We provide token usage results in Table 9 for FUNCODER and baseline methods on the HumanEval dataset with the GPT-3.5 model, whereas usage results on other datasets are provided in Table 10. We report the average token usage per problem. The token usage is computed through the sum of prompt tokens and completion tokens returned by OpenAI API chat completion call6 . For LDB, we report their token usage in the original paper (Zhong et al., 2024).
Table 9: Token usage for different settings of FUNCODER and baseline methods on HumanEval, all evaluated on GPT-3.5-turbo. The LDB results are reported in the original paper. The main setting for LDB and FUNCODER is bolded. Full Results for Code Generation We provide results for all conducted experiments on code generation benchmarks in Table 11. Our method consistently improves the baseline on community models by averaging 11% on MBPP and 150% on xCodeEval. It is worth noting that small models have a tendency to have low pass rates on competition problems, leading to a relatively higher randomness, therefore we run 3 experiments and report the median result.

Section: Method
Full Results for MATH The MATH dataset divides the problems into five levels of difficulty.
The difficulty distribution of our test set can be found in Table 7. We report the average accuracy of FUNCODER and other methods for each math subject in Table 12 and results for each level in Table 13. The results of Cumulative Reasoning are obtained from the original paper (Zhang et al., 2024). Experiment results demonstrate that our method consistently enhances the model's reasoning ability across all levels of MATH. 

Section: B Examples
We provide example solutions for the baseline and FUNCODER on code generation and mathematical reasoning. All results are generated with GPT-3.5.
B 

Section: ## Input
The only line contains four positive integers cnt1 , cnt2 , x , y (1 <= cnt1 , cnt2 <109; cnt1 + cnt2 <=109; 2 <=x <y <=3*104) -the numbers that are described in the statement . It is guaranteed that numbers x , y are prime .

Section: ## Output
Print a single integer -the answer to the problem . 
## Sample Input 1 ``3 1 2 3 ``# # Sample Output 1 ``5 ``# # Sample Input 2 ``1 3 2 3 ``# # Sample

Section: B.2.1 CoT
We can solve for $a$ by multiplying both sides of the congruence by the modular inverse of $27$ modulo $40$ . Since $27$ and $40$ are relatively prime , we know that $27$ has a modular inverse modulo $40$ . We can find this inverse using the Euclidean algorithm : \ begin { align *} \ text { gcd }(40 ,27) &= \ text { gcd }(27 ,13) &= \ text { gcd }(13 ,1) &= 1. \ end { align *} Therefore , the modular inverse of $27$ modulo $40$ exists . We can find it using the Euclidean algorithm backwards : \ begin { align *} 1 &= 13 -3 \ cdot 4 &= 13 -3(40 -27) &= 4 \ cdot 27 -3 \ cdot 40. \ end { align *} Thus , the modular inverse of $27$ modulo $40$ is $4$ . Multiplying both sides of the congruence $27a \ equiv 17 \ pmod {40} $ by $4$ gives $108a \ equiv 68 \ pmod {40} $ . Simplifying the left -hand side gives $8a \ equiv 28 \ pmod {40} $ . Dividing both sides by $4$ gives $2a \ equiv 7 \ pmod {10} $ . The smallest positive integer $a$ satisfying this congruence is $a =9 $ , and the second -smallest is $a =19 $ . Therefore , the sum of the smallest and second -smallest positive integers $a$ satisfying the congruence $27a \ equiv 17 \ pmod {40} $ is $9 +19=\ boxed {28} $ .  \"\"\" get common element in two list $a$ and $b$ \"\"\" ```python def prime_factor ( x : int ) -> list :

Section: Result
\"\"\" get a list of prime factors of number $x$ \"\"\" ret = [] i = 1 while i * i <= x : i += 1 if x % i == 0 and is_prime ( i ) : ret . append ( i ) return ret def is_prime ( x : int ) -> bool :
\"\"\" determine $x$ is a prime number or not \"\"\" if x < 2: return False for i in range (2 , int ( x **0.5) + 1) : if x % i == 0: return False return True def get_common ( a : list , b : list ) -> list :
\"\"\" get common element in two list $a$ and $b$ \"\"\" ret = [] for item in : if item in b : ret . append ( item ) return ret def sum_common_factors ( a : int , b : int ) -> int :
\"\"\" Return the sum of all common prime factors of $a$ and $b$ \"\"\" raise NotImplementedError () ``L et ' s think step by step and implement the following method sum_common_factors `using existing functions to solve : " Return the sum of all common prime factors of $a$ and $b$ " < Assistant >:  

Section: C.6 Prompt for MATH judging
You are a mathematical teacher , your task is to :
-Judge whether the prediction is matching the answer -Output " Judge : Correct ." or " Judge : Wrong ." , please do not output redundant words -Numerical errors should be ignored ( $1$ is equal to $0 .99999998 $ ) -Some answer might be represent in latex format , and some might be float number , this should be consider as correct ( $ \ frac {1}{2} $ is equal to $0 .5 $ , $3$ $ \ sqrt {66} $ is equal to $24 .37211 $ )
-Unit in answer should be ignored , and should be consider as correct ( $13 cm ^2 $ is equal to $13 .0 $ , $ \ $13$ is equal to $13$ )
Now , the answer and prediction is : Answer : { ground_truth } Prediction : { model_output } Please output " Judge : Correct ." if two answers are literally the same , or " Judge : Wrong ." for not same , please do not output redundant words .
• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean. • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Yes, please refer to the Broader Impacts paragraph in Discussion ( §5). Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [NA] Justification: We do not release new data or models.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes] Justification: We reference the code ( §A.2), data ( §A.3) and models ( §A.1) in the Appendix.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

Section: Acknowledgments
We would like to acknowledge the reviewers and chairs for their inspiring and constructive feedback. The research in this article is supported by the National Key Research and Development Project (2021YFF0901602), the National Science Foundation of China (U22B2059, 62276083). Ming Liu and Qianglong Chen are the corresponding authors.

Section: 
In Standard, the code is only generated once to complete the given stub. We use parentheses to represent the order of LLM calls in a full process.
(1) a -> A B C D E input tokens = a output tokens = A + B + C + D + E overall = O(N)
In each step of the FUNCODER/Divide stage, the to-be-implemented function will serve as the context. The function (stub) will be implemented and sub-function stubs are to be declared.
( These stages in all bring FUNCODER's total token consumption to strictly O(kN ) for every problem.
Token Complexity of FUNCODER is O(kN ) Define N as the token length of the final program, which is correlated to the inherent complexity of the problem, and define k as the number of sampled candidates. We first explain in detail that the worst-case token cost of FUNCODER is O(kN ): In the first sample you give the set of numbers {1 ,3 ,5} to the first friend and the set of numbers {2} to the second friend . Note that if you give set {1 , 3 , 5} to the first friend , then we cannot give any of the numbers 1 , 3 , 5 to the second friend . In the second sample you give the set of numbers {3} to the first friend , and the set of numbers {1 , 2 , 4} to the second friend . Thus , the answer to the problem is 4. -You need to write a Python function that returns the answer .

Section: B.1.1 Standard
-You can import libraries to better solve the problem .
-Do not write any code outside the function ( importing is accepted ) .
< User >:
Let ' s think step by step and complete the following Python function sum_factor `that solves : ```python def sum_factor ( a : int , b : int ) -> int :
\"\"\" Return the sum of all common prime factors of $a$ and $b$ \"\"\" raise NotImplementedError () ``< Assistant >: def prime_factor ( x : int ) -> list :
\"\"\" get a list of prime factors of number $x$ \"\"\" ret = [] i = 1 while i * i <= x : i += 1 if x % i == 0 and is_prime ( i ) : ret . append ( i ) return ret def is_prime ( x : int ) -> bool :
\"\"\" determine $x$ is a prime number or not \"\"\" if x < 2: return False for i in range (2 , int ( x **0.5) + 1) : if x % i == 0: return False return True def get_common ( a : list , b : list ) -> list :
\"\"\" get common element in two list $a$ and $b$ \"\"\"  Justification: We conduct experiments to reflect the performance of our methods in code generation ( §3.1) and mathematical reasoning ( §3.2), we also included analysis and ablation study in multiple aspects.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes] Justification: Yes, please refer to the Limitation paragraph in Discussion ( §5).
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [NA] Justification: This paper does not include theoretical results.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes] Justification: We include detailed information in the Appendix to support reproducibility, including method details and the model versions we use ( §A.1), baseline settings ( §A.2), dataset information ( §A.3), evaluation metrics ( §A.4), details of analysis process ( §A.4), complete examples ( §B) and prompts ( §C).
Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Our code is made available at https://github.com/cometeme/funcoder. Guidelines:
• The answer NA means that paper does not include experiments requiring code. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.


References:
[b0] Ahmad Wasi Uddin; Md Golam Rahman Tushar; Saikat Chakraborty; Kai-Wei Chang (2023). AVATAR: A parallel corpus for Java-python program translation. 
[b1] Jacob Austin; Augustus Odena; Maxwell I Nye; Maarten Bosma; Henryk Michalewski; David Dohan; Ellen Jiang; Carrie J Cai; Michael Terry; Quoc V Le; Charles Sutton (2021). Program synthesis with large language models. 
[b2] B Tom; Benjamin Brown; Nick Mann; Melanie Ryder; Jared Subbiah; Prafulla Kaplan; Arvind Dhariwal; Pranav Neelakantan; Girish Shyam; Amanda Sastry; Sandhini Askell; Ariel Agarwal; Gretchen Herbert-Voss; Tom Krueger; Rewon Henighan; Aditya Child; Daniel M Ramesh; Jeffrey Ziegler; Clemens Wu; Christopher Winter; Mark Hesse; Eric Chen; Mateusz Sigler; Scott Litwin; Benjamin Gray; Jack Chess; Christopher Clark; Sam Berner; Alec Mccandlish; Ilya Radford; Dario Sutskever;  Amodei (2020-12-06). Language models are few-shot learners. 
[b3] Federico Cassano; John Gouwar; Daniel Nguyen; Sydney Nguyen; Luna Phipps-Costin; Donald Pinckney; Ming-Ho Yee; Yangtian Zi; Carolyn Jane Anderson; Molly Q Feldman; Arjun Guha; Michael Greenberg; Abhinav Jangda (2023). Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Software Eng
[b4] Bei Chen; Fengji Zhang; Anh Nguyen; Daoguang Zan; Zeqi Lin; Jian-Guang Lou; Weizhu Chen (2023). Codet: Code generation with generated tests. 
[b5] Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Pondé De Oliveira Pinto; Jared Kaplan; Harrison Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N Carr; Jan Leike; Joshua Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob Mcgrew; Dario Amodei; Sam Mccandlish; Ilya Sutskever; Wojciech Zaremba (2021). Evaluating large language models trained on code. 
[b6] Wenhu Chen; Xueguang Ma; Xinyi Wang; William W Cohen (2023). Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research
[b7] Xinyun Chen; Maxwell Lin; Nathanael Schärli; Denny Zhou (2024). Teaching large language models to self-debug. 
[b8] Ole-Johan Dahl; Edsger W Dijkstra; Charles Antony; Richard Hoare (1972). Studies in data processing. Academic Press
[b9] Dheeru Dua; Shivanshu Gupta; Sameer Singh; Matt Gardner (2022). Successive prompting for decomposing complex questions. 
[b10]  Ebtech (2024). How to Interpret Contest Ratings -Codeforces. 
[b11] Zhangyin Feng; Daya Guo; Duyu Tang; Nan Duan; Xiaocheng Feng; Ming Gong; Linjun Shou; Bing Qin; Ting Liu; Daxin Jiang; Ming Zhou (2020). CodeBERT: A pre-trained model for programming and natural languages. 
[b12] Luyu Gao; Aman Madaan; Shuyan Zhou; Uri Alon; Pengfei Liu; Yiming Yang; Jamie Callan; Graham Neubig (2023-07-29). PAL: program-aided language models. PMLR
[b13] Qihao Daya Guo; Dejian Zhu; Zhenda Yang; Kai Xie; Wentao Dong; Guanting Zhang; Xiao Chen; Y Bi; Y K Wu; Fuli Li; Yingfei Luo; Wenfeng Xiong;  Liang (2024). Deepseek-coder: When the large language model meets programming -the rise of code intelligence. 
[b14] P R Halmos (1998). Naive Set Theory. Springer
[b15] Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt (2021-12). Measuring coding challenge competence with APPS. 
[b16] Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2021-12). Measuring mathematical problem solving with the MATH dataset. 
[b17] Sirui Hong; Mingchen Zhuge; Jonathan Chen; Xiawu Zheng; Yuheng Cheng; Jinlin Wang; Ceyao Zhang; Zili Wang; Steven Ka; Shing Yau; Zijuan Lin; Liyang Zhou; Chenyu Ran; Lingfeng Xiao; Chenglin Wu; Jürgen Schmidhuber (2024). MetaGPT: Meta programming for a multi-agent collaborative framework. 
[b18]  Openreview (2024). . 
[b19] Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2019). Codesearchnet challenge: Evaluating the state of semantic code search. 
[b20] Shima Imani; Liang Du; Harsh Shrivastava (2023). MathPrompter: Mathematical reasoning using large language models. 
[b21] Xue Jiang; Yihong Dong; Lecheng Wang; Qiwei Shang; Ge Li (2023). Self-planning code generation with large language model. 
[b22] Xin Jin; Jonathan Larson; Weiwei Yang; Zhiqiang Lin (2023). Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models. 
[b23] Mohammad Abdullah Matin Khan; M Saiful; Xuan Long Bari; Weishi Do; Md Rizwan Wang; Shafiq R Parvez;  Joty (2023). xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. 
[b24] Woosuk Kwon; Zhuohan Li; Siyuan Zhuang; Ying Sheng; Lianmin Zheng; Cody Hao Yu; Joseph E Gonzalez; Hao Zhang; Ion Stoica (2023). Efficient memory management for large language model serving with pagedattention. 
[b25] Chengshu Li; Jacky Liang; Andy Zeng; Xinyun Chen; Karol Hausman; Dorsa Sadigh; Sergey Levine; Li Fei-Fei; Fei Xia; Brian Ichter (2023). Chain of code: Reasoning with a language model-augmented code emulator. 
[b26] Raymond Li; Loubna Ben Allal; Yangtian Zi; Niklas Muennighoff; Denis Kocetkov; Chenghao Mou; Marc Marone; Christopher Akiki; Jia Li; Jenny Chim; Qian Liu; Evgenii Zheltonozhskii; Terry Yue Zhuo; Thomas Wang; Olivier Dehaene; Mishig Davaadorj; Joel Lamy-Poirier; João Monteiro; Oleh Shliazhko; Nicolas Gontier; Nicholas Meade; Armel Zebaze; Ming-Ho Yee; Logesh Kumar Umapathi; Jian Zhu; Benjamin Lipkin; Muhtasham Oblokulov; Zhiruo Wang; Rudra Murthy; V ; Jason Stillerman; Sankalp Siva; Dmitry Patel; Marco Abulkhanov; Manan Zocca; Zhihan Dey; Nour Zhang; Urvashi Moustafa-Fahmy; Wenhao Bhattacharyya; Swayam Yu; Sasha Singh; Paulo Luccioni; Maxim Villegas; Fedor Kunakov; Manuel Zhdanov; Tony Romero; Nadav Lee; Jennifer Timor; Claire Ding; Hailey Schlesinger; Jan Schoelkopf; Tri Ebert; Mayank Dao; Alex Mishra; Jennifer Gu; Carolyn Jane Robinson; Brendan Anderson; Danish Dolan-Gavitt; Siva Contractor; Daniel Reddy; Dzmitry Fried; Yacine Bahdanau; Carlos Jernite; Sean Muñoz Ferrandis; Thomas Hughes; Arjun Wolf; Leandro Guha; Harm Von Werra;  De Vries (2023). Starcoder: may the source be with you!. 
[b27] Yujia Li; David H Choi; Junyoung Chung; Nate Kushman; Julian Schrittwieser; Rémi Leblond; Tom Eccles; James Keeling; Felix Gimeno; Agustin Dal Lago; Thomas Hubert; Peter Choy; Cyprien De Masson D'autume; Igor Babuschkin; Xinyun Chen; Po-Sen Huang; Johannes Welbl; Sven Gowal; Alexey Cherepanov; James Molloy; Daniel J Mankowitz; Esme Sutherland Robson; Pushmeet Kohli; Koray Nando De Freitas; Oriol Kavukcuoglu;  Vinyals (2022). Competition-level code generation with alphacode. 
[b28] Tianyang Liu; Canwen Xu; Julian Mcauley (2024). Repobench: Benchmarking repository-level code autocompletion systems. 
[b29] Anton Lozhkov; Raymond Li; Loubna Ben Allal; Federico Cassano; Joel Lamy-Poirier; Nouamane Tazi; Ao Tang; Dmytro Pykhtar; Jiawei Liu; Yuxiang Wei; Tianyang Liu; Max Tian; Denis Kocetkov; Arthur Zucker; Younes Belkada; Zijian Wang; Qian Liu; Dmitry Abulkhanov; Indraneil Paul; Zhuang Li; Wen-Ding Li; Megan Risdal; Jia Li; Jian Zhu; Terry Yue Zhuo; Evgenii Zheltonozhskii; Nii Osae Osae Dade; Wenhao Yu; Lucas Krauß; Naman Jain; Yixuan Su; Xuanli He; Manan Dey; Edoardo Abati; Yekun Chai; Niklas Muennighoff; Xiangru Tang; Muhtasham Oblokulov; Christopher Akiki; Marc Marone; Chenghao Mou; Mayank Mishra; Alex Gu; Binyuan Hui; Tri Dao; Armel Zebaze; Olivier Dehaene; Nicolas Patry; Canwen Xu; Julian J Mcauley; Han Hu; Torsten Scholak; Sébastien Paquet; Jennifer Robinson; Carolyn Jane Anderson; Nicolas Chapados (2024). Starcoder 2 and the stack v2: The next generation. 
[b30] Shuai Lu; Daya Guo; Shuo Ren; Junjie Huang; Alexey Svyatkovskiy; Ambrosio Blanco; Colin B Clement; Dawn Drain; Daxin Jiang; Duyu Tang; Ge Li; Lidong Zhou; Linjun Shou; Long Zhou; Michele Tufano; Ming Gong; Ming Zhou; Nan Duan; Neel Sundaresan; Shengyu Shao Kun Deng; Shujie Fu;  Liu (2021-12). Codexglue: A machine learning benchmark dataset for code understanding and generation. 
[b31] Aman Madaan; Niket Tandon; Prakhar Gupta; Skyler Hallinan; Luyu Gao; Sarah Wiegreffe; Uri Alon; Nouha Dziri; Shrimai Prabhumoye; Yiming Yang; Shashank Gupta; Bodhisattwa Prasad Majumder; Katherine Hermann; Sean Welleck; Amir Yazdanbakhsh; Peter Clark (2023). Self-refine: Iterative refinement with selffeedback. 
[b32] A I Mistral (2024). Codestral: Hello, world!. 
[b33] Erik Nijkamp; Bo Pang; Hiroaki Hayashi; Lifu Tu; Huan Wang; Yingbo Zhou; Silvio Savarese; Caiming Xiong (2023). Codegen: An open large language model for code with multi-turn program synthesis. 
[b34]  Openreview (2023). . 
[b35] Theo X Olausson; Jeevana Priya Inala; Chenglong Wang; Jianfeng Gao; Armando Solar-Lezama (2024). Is self-repair a silver bullet for code generation?. 
[b36]  Openai (2023). . 
[b37] Long Ouyang; Jeffrey Wu; Xu Jiang; Diogo Almeida; Carroll L Wainwright; Pamela Mishkin; Chong Zhang; Sandhini Agarwal; Katarina Slama; Alex Ray; John Schulman; Jacob Hilton; Fraser Kelton; Luke Miller; Maddie Simens; Amanda Askell; Peter Welinder; Paul F Christiano; Jan Leike; Ryan Lowe (2022). Training language models to follow instructions with human feedback. 
[b38] Nikhil Pinnaparaju; Reshinth Adithyan; Duy Phung; Jonathan Tow; James Baicoianu; Ashish Datta; Maksym Zhuravinskyi; Dakota Mahan; Marco Bellagente; Carlos Riquelme; Nathan Cooper (2024). Stable code technical report. 
[b39] Chen Qian; Xin Cong; Cheng Yang; Weize Chen; Yusheng Su; Juyuan Xu; Zhiyuan Liu; Maosong Sun (2023). Communicative agents for software development. 
[b40] Jonas Baptiste Rozière; Fabian Gehring; Sten Gloeckle; Itai Sootla;  Gat; Ellen Xiaoqing; Yossi Tan; Jingyu Adi; Tal Liu; Jérémy Remez; Artyom Rapin; Ivan Kozhevnikov; Joanna Evtimov; Manish Bitton; Cristian Bhatt; Aaron Canton-Ferrer; Wenhan Grattafiori; Alexandre Xiong; Jade Défossez; Faisal Copet; Hugo Azhar; Louis Touvron; Nicolas Martin; Thomas Usunier; Gabriel Scialom;  Synnaeve (2023). Code llama: Open foundation models for code. 
[b41] Max Schäfer; Sarah Nadi; Aryaz Eghbali; Frank Tip (2024). An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Software Eng
[b42] Kensen Shi; Jacob Steinhardt; Percy Liang (2019). Frangel: component-based synthesis with control structures. Proc. ACM Program. Lang
[b43] Noah Shinn; Federico Cassano; Ashwin Gopinath; Shunyu Karthik R Narasimhan;  Yao (2023). Reflexion: language agents with verbal reinforcement learning. 
[b44] Mohammed Latif Siddiq; Joanna C S Santos; Ridwanul Hasan Tanvir; Noshin Ulfat; Fahmid Al Rifat; Vinicius Carvalho Lopes (2023). Exploring the effectiveness of large language models in generating unit tests. 
[b45] Haotian Sun; Yuchen Zhuang; Lingkai Kong; Bo Dai; Chao Zhang (2023). Adaplanner: Adaptive planning from feedback with language models. 
[b46] Hugo Touvron; Louis Martin; Kevin Stone; Peter Albert; Amjad Almahairi; Yasmine Babaei; Nikolay Bashlykov; Soumya Batra; Prajjwal Bhargava; Shruti Bhosale; Dan Bikel; Lukas Blecher; Cristian Canton-Ferrer; Moya Chen; Guillem Cucurull; David Esiobu; Jude Fernandes; Jeremy Fu; Wenyin Fu; Brian Fuller; Cynthia Gao; Vedanuj Goswami; Naman Goyal; Anthony Hartshorn; Saghar Hosseini; Rui Hou; Hakan Inan; Marcin Kardas; Viktor Kerkez; Madian Khabsa; Isabel Kloumann; Artem Korenev; Punit Singh Koura; Marie-Anne Lachaux; Thibaut Lavril; Jenya Lee; Diana Liskovich; Yinghai Lu; Yuning Mao; Xavier Martinet; Todor Mihaylov; Pushkar Mishra; Igor Molybog; Yixin Nie; Andrew Poulton; Jeremy Reizenstein; Rashi Rungta; Kalyan Saladi; Alan Schelten; Ruan Silva; Eric Michael Smith; Ranjan Subramanian; Ellen Xiaoqing; Binh Tan; Ross Tang; Adina Taylor; Jian Williams; Puxin Xiang Kuan; Zheng Xu; Iliyan Yan; Yuchen Zarov; Angela Zhang; Melanie Fan; Sharan Kambadur; Aurélien Narang; Robert Rodriguez; Sergey Stojnic; Thomas Edunov;  Scialom (2023). Llama 2: Open foundation and fine-tuned chat models. 
[b47] Xingyao Wang; Yangyi Chen; Lifan Yuan; Yizhe Zhang; Yunzhu Li; Hao Peng; Heng Ji (2024). Executable code actions elicit better LLM agents. 
[b48] Xuezhi Wang; Jason Wei; Dale Schuurmans; V Quoc; Ed H Le; Sharan Chi; Aakanksha Narang; Denny Chowdhery;  Zhou (2023). Self-consistency improves chain of thought reasoning in language models. 
[b49]  Openreview (2023). . 
[b50] Yue Wang; Weishi Wang; Shafiq Joty; Steven C H Hoi (2021). CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. 
[b51] Jason Wei; Xuezhi Wang; Dale Schuurmans; Maarten Bosma; Brian Ichter; Fei Xia; Ed H Chi; V Quoc; Denny Le;  Zhou (2022-11-28). Chain-of-thought prompting elicits reasoning in large language models. 
[b52] Thomas Wolf; Lysandre Debut; Victor Sanh; Julien Chaumond; Clement Delangue; Anthony Moi; Pierric Cistac; Tim Rault; Rémi Louf; Morgan Funtowicz; Jamie Brew (2019). Huggingface's transformers: State-of-the-art natural language processing. 
[b53] Weixiang Yan; Yuchen Tian; Yunzhe Li; Qian Chen; Wen Wang (2023). CodeTransOcean: A comprehensive multilingual benchmark for code translation. 
[b54] Ke Yang; Jiateng Liu; John Wu; Chaoqi Yang; Yi Fung; Sha Li; Zixuan Huang; Xu Cao; Xingyao Wang; Ji Heng; Chengxiang Zhai (2024). If LLM is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. 
[b55] Shunyu Yao; Dian Yu; Jeffrey Zhao; Izhak Shafran; Thomas L Griffiths; Yuan Cao;  Karthik R Narasimhan (2023). Tree of thoughts: Deliberate problem solving with large language models. 
[b56] Eric Zelikman; Qian Huang; Gabriel Poesia; Noah Goodman; Nick Haber (2023). Parsel: Algorithmic reasoning with language models by composing decompositions. 
[b57] Fengji Zhang; Bei Chen; Yue Zhang; Jacky Keung; Jin Liu; Daoguang Zan; Yi Mao; Jian-Guang Lou; Weizhu Chen (2023). RepoCoder: Repository-level code completion through iterative retrieval and generation. 
[b58] Yifan Zhang; Jingqin Yang; Yang Yuan; Andrew Chi-Chih Yao (2024). Cumulative reasoning with large language models. 
[b59] Lily Zhong; Zilong Wang; Jingbo Shang (2024). LDB: A large language model debugger via verifying runtime execution step-by-step. 
[b60] Denny Zhou; Nathanael Schärli; Le Hou; Jason Wei; Nathan Scales; Xuezhi Wang; Dale Schuurmans; Claire Cui; Olivier Bousquet; Ed H Quoc V Le;  Chi (2023). Least-to-most prompting enables complex reasoning in large language models. 
[b61] Ming Zhu; Karthik Suresh; Chandan K Reddy (2022-03-01). Multilingual code snippets training for program translation. AAAI Press

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: A flowgraph illustrates FUNCODER. FUNCODER branches off new functions to have sub-goals tackled iteratively (left), re-composites sub-functions, and selects the best using functional consensus (right). Bottom-right figure shows how FUNCODER writes functions at hierarchy-level.
Data: 

Figure fig_1: 
Type: figure
Caption: current function def find_factors(x: int) -> list:
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: Average accuracy in each level with the chat model (GPT-3.5) and the code model (StableCode 3b ) on the MATH benchmark.FUNCODER Can Handle Harder QuestionsFigure4compares between CoT, PoT, and FUN-CODER across varying difficulty levels. It illustrates that CoT performs comparatively well on the easiest questions, but suffers from a steep decline in performance as difficulty increases. This suggests that text-based reasoning is inadequate for tackling challenging mathematical reasoning problems. The same situation is also observed in PoT. In contrast, our method consistently demonstrates high performance even on challenging problems, particularly excelling on level 5 difficulty with nearly double the performance compared to PoT and CoT. This reflects that our method, with divide-and-conquer applied, can effectively cope with complex problems.
Data: 

Figure fig_3: 5
Type: figure
Caption: Figure 5 :5Figure 5: Left: Algorithm for FUNCODER. Right: Decomposition example of A[B[DE]C].
Data: 

Figure fig_4: 
Type: figure
Caption: [a.1] FunCoder(a) | [a.3] LLM(a) -> A, {b, c} # divide |--[b.1] FunCoder(b) | | [b.3] LLM(b) -> B, {d, e} # divide • The naive Standard method should naturally generate O(N ) tokens. Sampling-based baselines like CodeT cost O(kN ) tokens.
Data: 

Figure fig_5: 
Type: figure
Caption: Output 2 ```2 # check if there are cnt1 + cnt2 numbers divisible by both x and y lcm = ( x * y ) // gcd (x , y ) common_numbers = v -( v // lcm ) if common_numbers < cnt1 + cnt2 : return False return True def gcd ( a : int , b : int ) -> int :\"\"\" get the greatest common divisor of two numbers $a$ and $b$ \"\"\"Info: MATH/number_theory/42 (level 5)Problem What is the sum of the smallest and second-smallest positive integers a satisfying the congruence 27a ≡ 17 (mod 40) ?Ground 62
Data: 

Figure fig_6: 
Type: figure
Caption: #Find the smallest and second -smallest positive integers satisfying the congruence equation # using the formula : x + k * ( m // gcd (a , m ) ) gcd , _ , _ = extended_euclid (a , m ) s1 = x s2 = x + ( m // gcd ) return s1 + s2 def find_inverse ( a : int , m : int ) -> int :\"\"\" find the inverse of $a$ modulo $m$ \"\"\" _ , x , _ = extended_euclid (a , m ) return x % m if x >= 0 else ( x + m ) % m def extended_euclid ( a : int , b : int ) -> Tuple [ int , int , int ]for Divide You are a programming copilot , you can solve a problem by writing Python functions . Your task is to : -For every turn , you need to write a Python function that returns the answer based on Current Code ( not code in chat history ) . -Do not modify function name , arg names , docstring in given functions . -You can import libraries to better solve the problem . -If a single function is too hard to solve , you can decompose it into multiple smaller functions . -You can leave new function unimplemented for now , but write the function at the end of the code and comment what the function does .def sum_common_factors( a : int , b : int ) -> int : \"\"\" Compute the sum of all common prime factors of $a$ and $b$ \"\"\" raise NotImplementedError () ``L et ' s think step by step and complete the following Python function sum_common_factors `that solves : " Compute the sum of all common prime factors of $a$ and $b$ " < Assistant >: First , I need to get the prime factors of $a$ and $b$ . Second , I can use `for `loop to find common element in two factors list . Finally , sum the common factor list and return the answer . Here is the `sum_common_factors `function : ```python def sum_common_factors ( a : int , b : int ) -> int : \"\"\" Compute the sum of all common prime factors of $a$ and $b$ \"\"\" factors_a = prime_factor ( a ) factors_b = prime_factor ( b ) common_factors = get_common ( factors_a , factors_b ) return sum ( common_factors ) def prime_factor ( x : int ) -> list : \"\"\" get a list of prime factors of number $x$ \"\"\" raise NotImplementedError () def get_common ( a : list , b : list ) -> list : \"\"\" get common element in two list $a$ and $b$ \"\"\" raise NotImplementedError def sum_common_factors ( a : int , b : int ) -> int : \"\"\" Compute the sum of all common prime factors of $a$ and $b$ \"\"\" factors_a = prime_factor ( a ) factors_b = prime_factor ( b ) common_factors = get_common ( factors_a , factors_b ) return sum ( common_factors ) def get_common ( a : list , b : list ) -> list : \"\"\" get common element in two list $a$ and $b$ \"\"\" raise NotImplementedError () ``L et ' s think step by step and complete the following Python function get_common `that solves : " get common element in two list $a$ and $b$ " < Assistant >: Here is the `get_common `function : ```python def get_common ( a : list , b : list ) -> list :
Data: 

Figure fig_7: 
Type: figure
Caption: First, I need to get the prime factors of $a$ and $b$ . Second , I can use `for `loop to find common element in two factors list . Finally , sum the common factor list and return the answer . Here is the `sum_common_factors `function : ```python def sum_common_factors ( a : int , b : int ) -> int :\"\"\" Compute the sum of all common prime factors of $a$ and $b$ "
Data: 

Figure tab_1: 1
Type: table
Caption: Experiment results on code generation benchmarks. We report Pass@1 as evaluate metric. Results from the original paper are underlined, and the best results are bold.
Data: ModelMethodHumanEval Pass@1 ∆ ↑ Pass@1 ∆ ↑ Easy Mid Hard Expert All MBPP xCodeEvalStandard68.3-72.0-44.4 15.2 4.60.020.2CodeT81.1+12.876.0+4.050.6 16.1 8.00.023.2GPT-3.5Reflexion69.5+1.272.5+0.544.4 17.0 5.70.020.6LDB82.9+14.676.0+4.0-----FUNCODER85.4+17.178.5+6.562.4 29.5 11.60.031.4Standard82.9-73.5-68.5 39.3 19.51.737.4Parsel85.0+2.1-------GPT-4CodeT Reflexion90.9 91.0+8.0 +8.177.0 77.1+3.5 +3.676.4 51.8 21.8 71.3 41.1 19.53.4 2.544.0 38.6MetaGPT85.9+3.0-------FUNCODER94.5+11.679.5+6.083.1 58.0 26.43.448.6Standard61.6-60.5-9.0 1.80.00.03.6Llama3 8bCodeT68.9+7.361.5+1.012.4 0.00.00.04.4FUNCODER79.7+18.162.5+2.022.0 0.90.00.08.0Standard61.0-51.5-7.3 0.90.00.02.8StableCode 3bCodeT75.0+14.057.5+6.011.2 1.80.00.04.6FUNCODER81.0+20.063.5+12.0 13.5 4.51.10.06.2Standard43.9-53.5-2.3 0.00.00.00.8CodeLlama 34bCodeT55.5+11.656.5+3.010.1 0.00.00.03.6FUNCODER66.5+22.658.5+5.010.2 0.00.00.03.6

Figure tab_2: 
Type: table
Caption: 
Data: (final passed43.9%44.5%90.29085.4final failed4.3%19.5%85program wrong unit-test wrong25.0%12.8%5.5%16.5%80.5Strategy consensus75 80Pass@kself-testboth incorrect14.0%14.0%69.5random7050403020100010203040501 2 3 4 5 6 7 8 9 10 11GPT-3.5StableCode 3bNum Selected Programs

Figure tab_3: 
Type: table
Caption: 
Data: Figure 3: (a) Preliminary study on self-testing, the programs are evaluated using unit-tests generatedby LLMs. (b) The effectiveness of different ranking strategies. We compute the Pass@k over top-kprograms ranked by functional consensus, self-test, and random on 11 candidates. (higher is better)enhance LLM's capability of solving more complex programming tasks, with an average accuracyimprovement of 82.3% over the baseline on the Mid and Hard subsets of xCodeEval. Expert levelprograms, however, still remain a colossal challenge for even the most cutting-edge LLMs.Evaluation is also performed over community LLMs, Llama3 (Meta AI, 2024), StableCode (Pin-naparaju et al., 2024) and CodeLlama (Rozière et al., 2023) with results in Table 1. FUNCODERconsistently boosts the performance of smaller models in code generation, demonstrating notableimprovements compared to standard prompting on HumanEval, which gained +29.4% on Llama3,+32.8% on StableCode, and even +51.5% on CodeLlama, outperforming that from the previous bestmethod CodeT. We also supplement results on GPT-4o mini, Codestral and StarCoder2 in Table

Figure tab_4: 2
Type: table
Caption: Note that to generate every program FUNCODER costs only O(kN ) tokens, where k is the number of sampled candidates,
Data: 

Figure tab_5: 2
Type: table
Caption: Ablation study of FUNCODER on HumanEval with GPT-3.5. The setting in our main experiment is highlighted in bold. Tokens are calculated as the sum of prompts and completions.
Data: SettingDivide ConquerRankingPass@1Avg. TokensStandard✗✗✗68.3886.7One-pass✓✗✗72.6 (+4.3)1233.7Two-pass✓✓✗78.7 (+10.4)3343.2Two-pass + ST@11✓✓Self-Test@1180.5 (+12.2)5408.3Two-pass + CL@11✓✓Clustering@11 75.0 (+6.7)5070.7FUNCODER@5✓✓Consensus@5 83.5 (+15.2)4040.9FUNCODER@11✓✓Consensus@11 85.4 (+17.1)5402.0and N is the token length of the final program. This is further exemplified and explained in§A.7. We observe that function decomposition and re-composition deliver cumulative performanceimprovements. Functional consistency is also shown to prevail over self-testing. Putting them alltogether, FUNCODER received a +17.1 improvement with just 5.09× more tokens over baseline.Compared to previous SOTA LDB (≈ 23K tokens), we are able to gain +2.5 in performance with76.5% token usage reduction.

Figure tab_6: 3
Type: table
Caption: Experimental results on MATH, a competition-level mathematical reasoning benchmark. Best results are in bold. Text-based reasoning methods are denoted with † , while others use programaided reasoning. We report both overall results and results in seven subjects: Prealgebra, Algebra, Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus.
Data: ModelMethodPrealg. Alg. NTProb. Geo. InterAlg. Precalc. OverallStandard  †62.237.4 20.0 29.8 31.024.421.834.6CoT  †59.851.1 28.9 29.8 28.626.730.940.0GPT-3.5PoT68.350.4 33.3 48.9 21.418.229.141.0Self-Refine74.449.6 48.9 57.4 28.635.636.448.6FUNCODER76.861.2 55.6 59.6 34.136.041.854.0Standard  †81.782.7 71.1 72.3 59.546.747.368.2CoT  †84.187.1 62.2 68.1 45.248.954.568.6GPT-4PoT Self-Refine79.3 82.980.6 75.6 72.3 50.0 82.0 77.8 76.6 54.847.8 55.658.2 63.668.2 72.2CR86.686.3 88.7 71.1 53.751.551.872.2FUNCODER89.092.8 82.2 83.0 59.563.356.478.2CoT  †56.147.5 31.1 34.0 40.514.438.238.6Llama3 8bPoT67.132.4 24.4 34.0 16.721.118.232.6FUNCODER67.945.7 51.1 53.2 19.037.830.945.0StableCode 3bPoT FUNCODER20.7 46.314.4 17.8 25.5 30.2 20.0 29.84.8 4.88.9 20.09.1 18.214.4 26.6CodeLlama 34bPoT FUNCODER35.5 44.826.1 15.0 16.7 46.1 37.8 34.1 13.6 0.05.5 24.633.3 37.515.2 24.4

Figure tab_7: 4
Type: table
Caption: Top-3 most commonly used functions in each subject of MATH, listed in descending order.
Data: SubjectFunctionsPrealgebrais_prime / factorial / gcdAlgebrafind_roots / is_perfect_square / find_domainNumber Theoryget_divisors / mod_inverse / gcdCounting & Probability factorial / combinations / binomial_coefficientGeometrydistance / simplify_fraction / calculate_triangle_areaIntermediate Algebrafind_roots / evaluate_polynomial / lagrange_interpolationPrecalculuscross_product / fraction_from_angle / dot4 Related Work

Figure tab_8: 
Type: table
Caption: prompt the model to generate a program that solves mathematical or symbolic problems. MathPrompter(Imani et al., 2023) and Chain-of-Code(Li et al., 2023a)  fuse the text-based chain-of-thought with code-based program-of-thought prompting to complement each other in mathematical reasoning. Cumulative Reasoning(Zhang et al., 2024) conducts bottom-up reasoning to derive the final answer progressively. Numerous work(Sun et al., 2023;Wang et al., 2024;Yang et al., 2024) also use code as an intermediate component to bridge LLM agents with external environments.
Data: 

Figure tab_9: 5
Type: table
Caption: Symbols and Glossary.
Data: AliasDescription(i) Symbolsf (x)FunctionIn the programming language, a function consists ofheader, documentation, and its body {h f , d f , b f }. A func-tion can also be viewed as a mapping f : D(f ) → Y .h fFunction HeaderDeclares the function name, arguments, and return type,and is used as a signature to identify the function in aprogram.d fFunction DocstringProvides additional usage details for this function, but is(or Documentation)optional. We encourage the model to generate docstringsto describe sub-goals precisely.b fFunction BodyThe function body contains a subroutine that describes its(or Implementation)control flow and behavior. Functions may be invoked fromwithin.f ′ (x)Partially ImplementedA provisional function structure generated by the LLMwhere sub-procedures are not yet implemented.f  FFunction CompositionTo implement a certain function f respecting sub-procedures as potentially reusable components.(ii) GlossarySystem TestHidden TestSystem testing is a phase where a set of previously invis-ible test cases are run against the submitted program tovalidate if the code is correct and produces the expectedoutput for different categories of inputs.Unit TestAssertionA unit test is an assertion consisting of given input andexpected output, whereas in Python, it takes the form ofassert func(x) == y.Self-testing-Self-testing is an evaluation process that prompts themodel to generate unit tests (assertions) to assess the cor-rectness of the generated program.AlphaCode-like-AlphaCode proposed a clustering process that elects can-Clusteringdidate program from a number of samples, recognizingprograms that produce exactly identical outputs as equiva-lent, and picks one program from the largest cluster.

Figure tab_10: 
Type: table
Caption: Parsel(Zelikman et al., 2023) consists of three generation stages: high-level sketch, Parsel program, and final program. The Parsel program is an intermediate representation of code that describes and organizes program structure. We report the result of HumanEval with GPT-4 from the original paper.
Data: 

Figure tab_11: 6
Type: table
Caption: Overview and details of HumanEval, MBPP, xCodeEval, and MATH dataset.HumanEval(Chen et al., 2021) is a hand-crafted programming dataset designed to evaluate a model's code generation capability. It consists of 164 instances involving programming skills in language comprehension, reasoning, algorithms, and simple mathematics. The problem contains 2.8 sample inputs and outputs on average in the function document, which can be leveraged to provide additional guidance for the LLM to select or self-improve the programs. We conduct experiments on all 164 instances using accuracy (Pass@1) as the evaluation metric. The details of the Pass@1 metric are described in Appendix A.4.MBPP(Austin et al., 2021) consists of fundamental Python programming problems, with a total of 974 examples covering Python programming basics, standard library usage, and related assessment. FollowingShinn et al. (2023), we adopt the mbpp-typed split from MultiPL-E(Cassano et al., 2023) and sample 200 instances, using Pass@1 as the metric. The original prompt4 from MBPP includes all hidden tests in the input problem, which may cause label leakage when using these tests to refine or select programs. To ensure a fair comparison, MultiPL-E removes the test information in the prompt.xCodeEval(Khan et al., 2023) is a competition-level multilingual and multitask benchmark consisting of 17 programming languages. xCodeEval collects 25 million openly available samples from codeforces.com, a platform for competitive programming. The data we use include problem descriptions in problem_descriptions.jsonl and system tests from unittest_db.json which consists of 7,635 competition problems and averaged 51.1 tests per problem. Note that the tests in xCodeEval are crawled, some of them are incomplete due to the website context limit (they end with an ellipsis and the further content is missing); we filter out problems having invalid test cases.
Data: HumanEvalMBPPxCodeEvalMATHTaskCode Generation Code Generation Programming Contest Mathematical ReasoningAttribute--tags, difficultysubject, levelMetricPass@1Pass@1Pass@1EM-gpt# Sample (original)1644277,6355,000# Sample (ours)164200500500Entry funcvariantvariantmain()solution()Inputargumentsargumentsstandard inputn/aOutputreturnreturnstandard outputreturn# Examples Tests~2.80~2.1n/a# System Tests~8.1~3.151.1n/a

Figure tab_12: 
Type: table
Caption: For each question, we: (1) Obtain one solution program p from Standard Prompting. (2) Prompt the model to generate 7 self-tests T based on the question and entry function. The self-test is in the form of the unit test assert f(x) == y. We then judge the generated program p and canonical solution c over the self-tests T and system tests S.
Data: 

Figure tab_13: 8
Type: table
Caption: Explanation on how we classify cases in self-testing preliminary study.
Data: Class SubclassConditionExplanationThe self-test result is consistent with the final judge. How-self-test passedfinal passed(p, T ) ∧ (p, S)ever, self-testing methods cannot improve performance in this case, as the program from the baseline (Standard Prompt) is already correct.final failed(p, T ) ∧ ¬(p, S)Self-test is too weak to detect errors in the program, there could be edge cases that not been considered.program wrong¬(p, T ) ∧ ¬(p, S) ∧ (c, T )This is a good example that self-testing detects errors in the program. Feedback from the test will be used to select or refine the solution.self-testunit-test wrong¬(p, T ) ∧ (p, S) ∧ ¬(c, T )Bad case, the self-test produced an error result and fil-tered out a correct solution. Continuously revising the code for this test will lead to a performance downgrade.failedboth wrong¬(p, T ) ∧ ¬(p, S) ∧ ¬(c, T )The model is unable to generate a correct solution or test cases. Refining the program over faulty test samples will not lead to the correct answer.

Figure tab_14: 10
Type: table
Caption: Token usage of FUNCODER and baseline methods on other datasets, i.e. MBPP, xCodeEval and MATH. Results are evaluated on GPT-3.5-turbo.
Data: SettingPass@1Tokens Min. Max. Avg.Med.StandardOne-time68.36481477886.7861CodeTOne-time + Self-Test@1181.1 (+12.8) 2298 9645 4479.1 4166Reflexionmaxiter=269.5 (+1.2)4164906 1416.1 754line-level80.5 (+12.2)--24K-LDB (reported)block-level82.9 (+14.6)--23K-function-level79.9 (+11.6)--27K-One-pass72.6 (+4.3)8263489 1233.7 1132FUNCODERTwo-pass Two-pass + Consensus@578.7 (+10.4) 2197 8406 3343.2 3078 83.5 (+15.2) 2455 9432 4040.9 3800Two-pass + Consensus@11 85.4 (+17.1) 3015 13850 5402.0 5166FUNCODERTwo-pass + Self-Test@1180.5 (+12.2) 2967 13758 5408.3 5184(ablation)Two-pass + Clustering@1175.0 (+6.7) 3044 9958 5070.7 4888

Figure tab_15: 12
Type: table
Caption: Experimental results on MATH, a competition-level mathematical reasoning benchmark. Best results are in bold. Text-based reasoning methods are denoted with † , while others use programaided reasoning. We report both overall results and results in seven subjects: Prealgebra, Algebra, Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus.
Data: ModelMethodPrealg. Alg. NTProb. Geo. InterAlg. Precalc. OverallStandard  †62.237.4 20.0 29.8 31.024.421.834.6CoT  †59.851.1 28.9 29.8 28.626.730.940.0GPT-3.5PoT68.350.4 33.3 48.9 21.418.229.141.0Self-Refine74.449.6 48.9 57.4 28.635.636.448.6FUNCODER76.861.2 55.6 59.6 34.136.041.854.0Standard  †81.782.7 71.1 72.3 59.546.747.368.2CoT  †84.187.1 62.2 68.1 45.248.954.568.6GPT-4PoT Self-Refine79.3 82.980.6 75.6 72.3 50.0 82.0 77.8 76.6 54.847.8 55.658.2 63.668.2 72.2CR86.686.3 88.7 71.1 53.751.551.872.2FUNCODER89.092.8 82.2 83.0 59.563.356.478.2Standard  †79.383.5 75.6 87.2 47.657.856.471.8CoT  †90.295.7 82.2 68.1 50.061.161.877.2GPT-4o miniPoT80.584.2 77.8 72.3 50.060.050.971.0Self-Refine79.383.5 75.6 87.2 47.657.856.471.8FUNCODER81.783.5 80.0 80.9 59.560.054.573.2CoT  †56.147.5 31.1 34.0 40.514.438.238.6Llama3 8bPoT67.132.4 24.4 34.0 16.721.118.232.6FUNCODER67.945.7 51.1 53.2 19.037.830.945.0Codestral 22bPoT FUNCODER70.7 81.756.1 46.7 44.7 21.4 61.9 46.7 55.3 28.626.7 45.630.9 38.245.6 54.8StableCode 3bPoT FUNCODER20.7 46.314.4 17.8 25.5 30.2 20.0 29.84.8 4.88.9 20.09.1 18.214.4 26.6CodeLlama 34bPoT FUNCODER35.5 44.826.1 15.0 16.7 46.1 37.8 34.1 13.6 0.05.5 24.633.3 37.515.2 24.4StarCoder2 15bPoT FUNCODER46.3 72.029.5 28.9 25.5 21.4 39.6 40.9 46.8 23.827.8 28.123.6 27.330.2 40.8

Figure tab_16: 13
Type: table
Caption: Full results of each method at different levels of MATH. The best results are in bold. Text-based reasoning methods are denoted with † , while others use program-aided reasoning.
Data: ModelMethodLevel 1 Level 2 Level 3 Level 4 Level 5 OverallStandard  †61.551.143.525.917.734.6CoT  †76.948.950.933.621.840.0GPT-3.5PoT61.551.156.533.624.141.0Self-Refine84.661.165.732.831.348.6FUNCODER84.665.968.543.137.454.0Standard  †89.785.683.355.251.068.2CoT  †94.981.177.864.750.368.6GPT-4PoT Self-Refine94.9 94.980.0 81.174.1 83.363.8 62.153.1 60.568.2 72.2CR90.790.081.966.452.272.2FUNCODER94.990.081.575.966.078.2Standard  †87.282.280.662.961.971.8CoT  †97.490.087.071.661.277.2GPT-4o miniPoT89.781.176.963.861.271.0Self-Refine87.282.280.662.961.971.8FUNCODER94.982.281.562.963.973.2CoT  †76.946.746.325.927.938.6Llama3 8bPoT64.143.341.725.017.032.6FUNCODER79.560.052.337.427.945.0Codestral 22bPoT FUNCODER79.5 84.656.7 66.757.4 67.634.5 43.129.9 39.545.6 54.8StableCode 3bPoT FUNCODER35.9 53.822.2 37.819.4 35.27.8 21.65.4 10.214.4 26.6CodeLlama 34bPoT FUNCODER36.1 60.630.7 52.128.0 44.313.0 28.88.8 16.315.2 24.4StarCoder2 15bPoT FUNCODER43.6 71.844.4 57.845.4 55.120.7 26.714.3 23.330.2 40.8

Figure tab_17: 
Type: table
Caption: You want to present each of them several positive integers . You want to present cnt1 numbers to the first friend and cnt2 numbers to the second friend . Moreover , you want all presented numbers to be distinct , that also means that no number should be presented to both friends . In addition , the first friend does not like the numbers that are divisible without remainder by prime number x . The second one does not like the numbers that are divisible without remainder by prime number y . Of course , you ' re not going to present your friends numbers they don ' t like . Your task is to find such minimum number v , that you can form presents using numbers from a set 1 ,2 ,... , v . Of course you may choose not to present some numbers at all . A positive integer number greater than 1 is called prime if it has no positive divisors other than 1 and itself .
Data: .1 Code GenerationInfo: xCodeEval/ff3c39b759a049580a6e96c66c904fdc (binary search, difficulty 1800)Problemtime limit per test : 1 secondmemory limit per test : 256 megabytesinput : standard inputoutput : standard output## DescriptionYou have two friends .

Figure tab_20: 
Type: table
Caption: Do not bring in imports other than what s already imported . Use the pre -declared imports in the original function only . -The callee may have multiple arguments , treat them with care . -You ** must ** respect the function signature and docstring , and be aware so you don ' t generate illegal inputs . -Keep the inputs & outputs simple but general , and that either edge cases or common cases are meaningful .
Data: assert lcm (1 , 5) == 5Let ' s think step by step and create some tests for the following assert lcm (1 , 1) == 1function `{ cur_func_name }(...) `in Python . assert lcm (17 , 19) == 17 * 19``<```python User >:{ prev_code }``S Extract tests for the following function `{ cur_func_name }(...) `inPython .tore your function calls for `{ cur_func_name }(...) `as functioncallss , one per line . They will be called later . ```python{ prev_code }C.5 Prompt for Self-Test Generation ``SYou are a proficient software engineer and architect , specialized in tore your test cases for `{ cur_func_name }(...) `as assertions , onetesting , capable of observing and finding security holes and per line . They will be called later .logical issues in code that spans across multiple computer sciencefields and mathematics . You are given a piece of Python function ,and are expected to produce some test cases for that function .Specifically :-You should invoke the function and assert its results in a one -liner fashion .-Here follows a series of mutually uncorrelated functions to test , oneper conversation .< User >:Let ' s think step by step and create some tests for the followingfunction `lcm (...) `in Python .```pythondef lcm ( a : int , b : int ) -> int :\"\"\" Find the least common multiple of `a `and `b `. Samples :>>> lcm (3 , 5)15>>> lcm (4 , 6)12\"\"\"return round ( a * b / gcd (a , b ) )``Store your test cases for `lcm (...) `as assertions , one per line . Theywill be called later .< Assistant >:```python Sure , I can create some test cases for the `check_valid_bracketsfunction{ prev_code }. We consider the following cases : 1. the two operands are``Lnot co -prime and has common factors ; 2. the two operands areequal ; 3. one of them is 1; 4. two of them is 1; 5. both operandset ' s think step by step and implement the following method are primes . Here is an example of these test cases in Python : `{cur_func_name } `using existing functions to solve :"{ cur_func_doc }" ```pythonassert lcm (15 , 25) == 75assert lcm (32 , 32) == 32

Figure tab_21: 
Type: table
Caption: • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We include the models and compute resources in Appendix A.1. Guidelines: • The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: I have checked the Code Of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
Data: 


Formulas:
Formula formula_0: for f i ∈ {f i } do 5:

Formula formula_1: f * i ← FUNCODER(f i ) ▷ recursion 7:

Formula formula_2: f * cur ← F(f ′ cur , {f * 1 , f * 2 , . . . }) | f * i ∈ CHILD(f cur

Formula formula_3: sim(f, g) = x∈D(f ) 1 [f (x) = g(x)] |D(f )| ≈ x∈X|X∼D(f ) 1 [f (x) = g(x)] |X|(1)

Formula formula_4: ) : f (x) = g(x) ⇔ sim(f, g) = 1.

Formula formula_5: f * = FUNCONSENSUS(F ) = arg max f (i) ∈F f (j) ∈F \{f (i) } sim(f (i) , f (j) )(2)

Formula formula_6: ¬(p, T ) ∧ (p, S) ∧ (c, T )

Formula formula_7: 3: f ′ cur , {f i } ← EXTRACT(LLM(f cur )) 4: for f i ∈ {f i } do 5: if b i is NOTIMPLEMENTED then 6: f * i ← FUNCODER(f i ) ▷ recursion 7: end if 8: ADDCHILD(f cur , f * i ) 9:

Formula formula_8: F cur ← SAMPLE(LLM(f ′ cur , CHILD(f cur )) 12:

Formula formula_9: ## Sample Input 1 ``3 1 2 3 ``# # Sample Output 1 ``5 ``# # Sample Input 2 ``1 3 2 3 ``# # Sample
