Efficient Phishing Website Detection via HTML Tag Sequence Analysis Using Encoder Models

Jemin Ahn, Zuobin Xiong, Homook Cho, Kyungtae Kang, Junggab Son

Published: 2025, Last Modified: 07 Nov 2025ICCCN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The rapid proliferation of Internet of Things (IoT) devices has led to a significant increase in the number of network users, prompting advancements in security mechanisms. Consequently, traditional attacks targeting specific vulnerabilities have become less effective due to these enhanced defense systems, leading attackers to increasingly adopt phishing strategies as a primary means of bypassing security measures. Among these, phishing websites have been increasing rapidly, exploiting the carelessness of countless users. In response, numerous phishing website detection methods have been investigated, with machine learning-based approaches emerging as a leading strategy. However, these machine learning-based classification methods require substantial computational resources, posing challenges for their direct application in the already widespread IoT environment. To address these challenges, we propose an efficient phishing website detection method based on HTML tag sequences, the core structural elements of websites, by leveraging encoder models known for their effectiveness in classifying sequential data. Our approach also incorporates a customized tokenizer and dictionary specifically tailored for HTML tags. Experiments conducted on publicly available datasets demonstrate that the proposed method achieves over 95% accuracy across key performance metrics. Furthermore, comparative analyses highlight several advantages of our method, including reduced model size and faster detection times compared to existing approaches.

External IDs:dblp:conf/icccn/AhnXCKS25