Towards Efficient GNN-Based Phishing Detection from HTML Source Code

Warre Hofmans; Wei Wei; Simon Vanneste; Kevin Mets

Towards Efficient GNN-Based Phishing Detection from HTML Source Code

Warre Hofmans, Wei Wei, Simon Vanneste, Kevin Mets

Published: 15 Oct 2025, Last Modified: 31 Oct 2025BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Type A (Regular Papers)

Keywords: Phishing detection, GNN

Abstract: Phishing websites are a common cyber fraud strategy used to deceive users into disclosing personal or sensitive information by impersonating legitimate websites. Such attacks often have long identification times and are accompanied by high costs. These types of attacks have been a cyberthreat for a long time, but are occurring more frequently, becoming more sophisticated and accessible with the introduction of generative AI tools. Although previous research has achieved great success in detecting phishing websites, most of the earlier techniques are becoming obsolete with the latest advances in the phishing scene, as the pages are increasing in quality. This paper introduces a robust GNN-based approach to detecting phishing pages by identifying irregularities in their HTML source code, such as poor semantics, the presence of malicious code, or the use of phishing kits. An HTML-reduction algorithm is introduced to a) reduce structural noise and b) lower the computational costs. Using a simple node feature extraction process and a reduction algorithm yields a computationally efficient model, achieving 95.57\% F1. The HTML DOM tree-based approach was validated additionally by a) an in-depth dataset analysis, showing a clear difference in benign and phishing source code, and b) traditional machine learning models (Random Forest and XGBoost) achieving up to 96.00\% F1 using manually extracted graph features.

Serve As Reviewer: ~Kevin_Mets1

Submission Number: 42

Loading