The Case for Learned Provenance-based System Behavior Baseline

Yao Zhu; Zhenyuan LI; Yangyang Wei; Shouling Ji

The Case for Learned Provenance-based System Behavior Baseline

Yao Zhu, Zhenyuan LI, Yangyang Wei, Shouling Ji

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Provenance graphs describe data flows and causal dependencies of host activities, enabling to track the data propagation and manipulation throughout the systems, which provide a foundation for intrusion detection. However, these Provenance-based Intrusion Detection Systems (PIDSes) face significant challenges in storage, representation, and analysis, which impede the efficacy of machine learning models such as Graph Neural Networks (GNNs) in processing and learning from these graphs. This paper presents a novel learning-based anomaly detection method designed to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, facilitating compact embeddings that effectively address out-of-vocabulary (OOV) elements and adapt to normality shifts in dynamic real-world environments. Subsequently, we incorporate this refined baseline into a tag-propagation framework for real-time detection. Our evaluation demonstrates the method's accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection.

Lay Summary: Computer systems handle large amounts of information every day. A provenance graph acts like a timeline that records what happened, who did it, and when it happened. When a hacker attacks a system, this graph can help us trace their actions and find out which users and data were affected. However, since the graph keeps track of even the smallest events in the system, it grows quickly and can become huge. This makes it hard to spot what the hacker did. We decompose system activities into individual events, such as one process reading a file. By checking how frequent each event happens, we can tell whether it is normal or unusual. We also take advantage of out-of-vocabulary words -- those not seen during training -- to help us mine for threats. Using a lightweight detection model, our system monitors the growing provenance graph in real time and instantly raises an alert when it detects an anomaly path. Our method can recognize new types of system behaviors and tell them apart from actual attacks. It runs efficiently, using only a small amount of computing resources, and can manage and analyze large-scale provenance graphs.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Everything Else

Keywords: graph representation learning, provenance graph, cyber attack detection

Submission Number: 15752

Loading