
# liveBench translation

To build CodeAlignBench, we sought to source coding tasks that required producing functionally correct code from natural language descriptions. A critical requirement for task selection was ensuring the absence of data contamination from model pretraining corpora. Data leakage remains a major concern in LLM benchmarking, as exposure to evaluation data during pretraining can compromise the validity of benchmark results. To address this, we leveraged LiveBench tasks. LiveBench provides Python-based code generation tasks collected from competitive programming platforms such as LeetCode and AtCoder, where problems are frequently updated and less likely to appear in pretraining data. 

Evaluating only Python limits the applicability of CodeAlignBench and undermines its relevance to a diverse set of use cases. To assess variations in instructions across diverse programming languages and how well models follow these instructions, we extended LiveBench to support two additional languages: Java and JavaScript. This selection balances language popularity, diversity in programming paradigms, and practical relevance. 

Extending these tasks to additional languages involves two core steps: (1) translating the problem statement into the target language, and (2) developing language-specific code execution and evaluation frameworks. We describe each of these steps in detail below.

## Prompt translation

LiveBench tasks are presented in natural language and are generally language-agnostic. However, some competition platforms include a predefined function signature that models must adhere to. When available, this signature is included in the prompt provided to models. Function signatures, however, vary across programming languages, and LiveBench does not natively provide multilingual versions. Moreover, because LiveBench is continuously updated with new tasks, a one-time manual translation of signatures is insufficient for maintaining comprehensive language coverage. To address this challenge, we developed an automated translation pipeline that converts function signatures into equivalent forms across all supported programming languages. Moreover, this pipeline can be extended to support any additional language that is not covered by this work.

The translation pipeline begins by prompting a LLM to translate a given Python function signature into a target language. To ensure the correctness of these translations, we implemented an automated validator that parses both the original and translated signatures and enforces:

1. The function name must remain unchanged.
2. The number and names of the arguments must be preserved.
3. The argument types and return type must be valid in the target language.

If any of these constraints are violated, the validator prompts the LLM to revise its translation based on the identified issue. This iterative refinement process ensures accurate and consistent function signature translations across languages. We evaluated the validator’s effectiveness on 78 tasks from the LiveBench 2024-11-25 release. Three developers with medium expertise in each language independently reviewed the translated signatures and evaluated their correctness. In cases of disagreements among annotators, two of the authors manually reviewed the translations and resolved discrepancies through discussion. This process confirmed the correctness of all auto-translated signatures.

## Code evaluation framework

To evaluate the functional correctness of generated code, each code snippet is executed against a predefined set of test cases. To support this evaluation across multiple programming languages, we developed an evaluation framework that handles code execution in different languages in a consistent and extensible manner. 

For each supported language, the evaluation framework defines formatting conventions for test cases, e.g., input/output handling, function invocation syntax, and any required runtime dependencies or language-specific execution environments. Our evaluation framework is designed to be modular and extensible, allowing new languages to be added with minimal effort by specifying language-specific components such as test runners and execution wrappers. To enable further extensions, we have provided this framework along with full implementations for all supported languages in our supplementary materials, which will be made available online upon approval.
