Success is in the details: Test and Improve Sensitive to Details of Code LLMs through Counterfactuals

Success is in the details: Test and Improve Sensitive to Details of Code LLMs through Counterfactuals

ACL ARR 2025 February Submission8502 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions, serving as an important dimension for evaluating Code LLM's ability. While current code benchmarks and instruction tuning methods primarily focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations—minimizing input changes while maximizing output changes. The evaluation shows that many models have a more than 10\% performance drop in performance on CTF-Code. To fully utilize sensitivity, we propose the CTF-Instruct incremental instruction fine-tuning framework, which extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that models fine-tuned with CTF-Instruct data achieve over a 2\% improvement on CTF-Code, and more than a 10\% performance boost on LiveCodeBench, validating the feasibility of enhancing LLMs sensitivity to improve performance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmark,resources,evaluation

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: Python

Submission Number: 8502

Loading