Abstract: While the AI-code assistant tools become widespread, automatic assessment of the correctness of the generated code becomes a significant challenge. Code-generating LLMs are prone to hallucinations, which may lead to code that doesn't solve a required problem or even to code with severe security vulnerabilities.
In this paper, we propose a new approach to assessment of code correctness. Our solution is based on topological data analysis (TDA) of attention maps of code LLMs.
We carry out experiments with two benchmarks - HumanEval, MBPP and 5 code LLMs: StarCoder2-7B, CodeLlama-7B, DeepSeek-Coder-6.7B, Qwen2.5-Coder-7B, and Magicoder-S-DS-6.7B.
Experimental results show that the proposed method is better than several baselines. Moreover, the trained classifiers are transferable between coding benchmarks.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Generation, Generalization of NLP Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: Python
Submission Number: 7851
Loading