Phishing Kits Source Code Similarity Distribution: A Case Study

Ettore Merlo; Mathieu Margier; Guy-Vincent Jourdan; Iosif-Viorel Onut

Phishing Kits Source Code Similarity Distribution: A Case Study

Ettore Merlo, Mathieu Margier, Guy-Vincent Jourdan, Iosif-Viorel Onut

Published: 01 Jan 2022, Last Modified: 12 May 2025SANER 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Attackers (“phishers”) typically deploy source code in some host website to impersonate a brand or in general a situation in which a user is expected to provide some personal information of interest to phishers (e.g. credentials, credit card number). Phishing kits are ready-to-deploy sets of files that can be simply copied on a web server and used almost as they are. In this paper, we consider the static similarity analysis of the source code of 20871 phishing kits totalling over 182 million lines of PHP, Javascript and HTML code, that have been collected during phishing attacks and recovered by forensics teams. Reported experimental results show that as much as 90% of the analyzed kits share 90% or more of their source code with at least another kit. Differences are small, less than about 1000 programming words – identifiers, constants, strings and so on – in 40% of cases. A plausible lineage of phishing kits is presented by connecting together kits with the highest similarity. Obtained results show a very different reconstructed lineage for phishing kits when compared to a publicly available application such as Wordpress. Observed kits similarity distribution is consistent with the assumed hypothesis that kit propagation is often based on identical or near-identical copies at low cost changes. The proposed approach may help classifying new incoming phishing kits as “near-copy” or “intellectual leaps” from known and already encountered kits. This could facilitate the identification and classification of new kits as derived from older known kits.

Loading