SUT: Active Defects Probing for Transcompiler Models

Mengnan Qi; Yufan Huang; Maoquan Wang; Yongqiang Yao; Zihan Liu; Bin Gu; Colin Clement; Neel Sundaresan

SUT: Active Defects Probing for Transcompiler Models

Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin Gu, Colin Clement, Neel Sundaresan

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Theme Track: Large Language Models and the Future of NLP

Submission Track 2: Language Modeling and Analysis of Language Models

Keywords: Program translation, LLM Evaluation, Unit Test, Syntax Error Analysis

TL;DR: We propose a challenging code translation evaluation dataset and provide supporting analysis tools to help developers find the weak points of the model in translating syntax elements

Abstract: Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies.

Submission Number: 2807

Loading