How Can ChatGPT Support Human Security Testers to Help Mitigate Supply Chain Attacks?

Ying Zhang, Wenjia Song, Zhengjie Ji, Danfeng Yao, Na Meng

Published: 2026, Last Modified: 26 May 2026IEEE Trans. Software Eng. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Developers often build software on top of third-party libraries (Libs) to improve programmer productivity and software quality. The libraries may contain vulnerabilities exploitable by hackers to attack the applications (Apps) built on top of them. Such attacks are known as software supply chain attacks, the documented number of which has increased by 600% since 2021. Researchers and developers created tools to mitigate such attacks, by scanning the library dependencies of Apps, identifying the usage of vulnerable library versions, and suggesting secure alternatives to vulnerable dependencies. However, recent studies show that many developers do not trust the reports by these tools; they need code or evidence to demonstrate how library vulnerabilities lead to security exploits, in order to assess vulnerability severity and modification necessity. Unfortunately, manually crafting demos of application-specific attacks is challenging and time-consuming, and there is insufficient tool support to automate that procedure. To help developers enhance software security, in this study, we systematically explored the usage of a large language model (LLM)—ChatGPT-4.0—to generate security tests, which unit tests demonstrate how vulnerable library dependencies facilitate the supply chain attacks to given Apps. In our exploration, we defined prompt templates to take in the various vulnerability-relevant information we manually collected, and generated prompts from those templates to query ChatGPT for security test generation. We found that ChatGPT-generated tests demonstrated 24 evidence or proof of vulnerability for 49 Apps. To assess the consistency of test generation, we also evaluated another five state-of-the-art LLMs. All the models generated security tests for at least 17 cases that successfully demonstrate the vulnerabilities. We filed six reports for the newly revealed vulnerabilities in Apps, and got four Common Vulnerability Entries (CVEs) assigned. Our use of ChatGPT outperformed two state-of-the-art security test generators (TRANSFER and SIEGE), by generating a lot more tests and achieving more attacks. Our research will shed light on new research in security test generation.

External IDs:dblp:journals/tse/ZhangSJYM26