Abstract: Large language models (LLMs) have demonstrated remarkable performance in code generation and evaluation tasks, particularly for Python, which dominates the pre-training corpora. However, the evaluation of code in low-resource programming languages remains challenging due to limited data and suboptimal model alignment. In this paper, we propose CrossPyEval, a novel cross-language code evaluation framework that uses an LLM to translate code from other languages into Python, verifies consistency with an SMT solver, and then analyzes the translated code via abstract syntax trees before performing the final evaluation. Experiments on public benchmarks and our custom low-resource datasets demonstrate that CrossPyEval substantially boosts evaluation accuracy for non-Python languages, achieving up to an 8.83\% improvement, and significantly enhances alignment with human judgments, with the Kendall correlation rising to as high as 0.689.
Supplementary Material: zip
Submission Number: 198
Loading