WatchLog: From a Glimpse to Decision—Rapid Event Reasoning in Endpoint Detection and Response Logs with Multimodal LLMs

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Endpoint Detection and Response; Interpretable Reasonings; Cyberattack Detection; Video-Language Modeling; Log Analysis;
Abstract: Endpoint Detection and Response (EDR) systems are essential for detecting malicious activities on endpoint devices, yet existing approaches struggle to efficiently process ultra-long log sequences and provide interpretable reasonings for security analysts. This paper presents \textbf{WatchLog}, a framework that models raw logs as video-structured representations to enable efficient video-language modeling of endpoint behaviors. Specifically, each event is encoded into a key-value guided image, and the resulting images are temporally arranged into a video-structured sequence. A temporal cross-attention mechanism then performs pixel-wise temporal aggregation, producing compact sequence embeddings that preserve behavioral fidelity while reducing computational cost. We conduct two-stage pre-training followed by supervised fine-tuning to generate behavioral explanations grounded in the semantics of event sequences and final judgments. Experiments on our newly constructed EDR8M-20R dataset demonstrate that WatchLog achieves higher detection accuracy and recall than the state-of-the-art baselines, while also generating reliable reasoning traces and enabling more efficient inference. Furthermore, our real-world application of WatchLog has validated its efficiency, effectiveness, and strong generalization capabilities.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 22954
Loading