Keywords: Phishing detection; LLM agent; Multimodal; Cybersecurity
Abstract: Phishing website detection traditionally relies on static heuristics or few-shot classifiers, which struggle to adapt to rapidly evolving attack patterns. Recent systems incorporate large language models (LLMs) but still use prompt-based, deterministic pipelines that under-utilize LLM reasoning. In this work, we introduce MemoPhishAgent, the first memory-augmented multi-modal LLM agent framework that dynamically orchestrates five specialized tools to gather the evidence needed for phishing detection. Central to our design is an episodic memory system that captures past reasoning trajectories and final judgments, supporting three retrieval modes: (1) majority-vote for instant, high-confidence decisions, (2) in-context exemplars for guided LLM prompting, and (3) full ReAct for novel threats. Crucially, we evaluate under realistic conditions on two public benchmark datasets. Experiment results show that MemoPhishAgent outperforms state-of-the-art (SOTA) baselines across four metrics, achieving significantly higher recall while keeping latency manageable. Analysis of memory design demonstrates that episodic memory boosts recall by over 20% while reducing computational overhead. An ablation study further validates the necessity of the agent-based approach by comparing MemoPhishAgent to two simplified variants. Together, our results show that combining multi-modal reasoning with episodic memory yields robust, adaptable phishing detection in realistic user-exposure settings.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15531
Loading