An Empirical Study of Code Clones from Commercial AI Code Generators

Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, Michael R. Lyu

Published: 01 Jan 2025, Last Modified: 03 Sept 2025Proc. ACM Softw. Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones.

External IDs:dblp:journals/pacmse/WuHFQHLZL25