% \documentclass{uai2025}
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions

% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}


\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\usepackage{multirow}
\usepackage{makecell}
\usepackage{caption}

%\usepackage{marvosym}
\usepackage[capitalize,noabbrev]{cleveref}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}
\usepackage[misc]{ifsym}
\usepackage{ulem}
\usepackage{float}
\usepackage{microtype} 
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
% \newcommand{\swap}[3][-]{#3#1#2} % just an example
% \begin{document}
% \maketitle

\title{FALCON: Adaptive Cross-Domain APT Attack Investigation with Federated Causal Learning}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Jialu Tang}
\author[1,*]{\href{mailto:<gaoyali@bupt.edu.cn>?Subject=FALCON Adaptive Cross-Domain APT Attack Investigation with Federated Causal Learning}{Yali Gao}}
\author[1]{Xiaoyong Li}
\author[1]{Jiawei Li}
\author[2]{Shui Yu}
\author[3]{Binxing Fang}

% Add affiliations after the authors
\affil[1]{%
    School of Cyberspace Security\\
    Beijing University of Posts and Telecommunications\\
    Beijing, China
}
\affil[2]{%
    School of Computer Science\\
    University of Technology Sydney\\
    Ultimo, NSW, Australia
}
\affil[3]{%
    Cyberspace Institute of Advanced Technology\\
    Guangzhou University\\
    Guangzhou, China
}
\affil[*]{%
    Corresponding author\\
    email <gaoyali@bupt.edu.cn>
}
\begin{document}
\maketitle

\begin{abstract}
 With the extensive deployment and application of Internet of Things (IoT) devices, vulnerable edge nodes have emerged as primary targets for Advanced Persistent Threat (APT) attacks. Attackers compromise IoT terminal devices to establish an initial foothold and subsequently exploit lateral movement techniques to progressively infiltrate core business networks. Prior investigation methods struggle with fragmented threat intelligence and sparse attack samples in heterogeneous audit logs, resulting in incomplete attack chain reconstruction and high false positives. We propose a novel approach to APT attack investigation, FALCON, which captures complex causal relationships between entities from discrete audit logs and constructs cross-domain provenance graphs, enabling rapid and accurate identification of potential APT activities. FALCON trains an adaptive edge-side local model with cross-domain behavior sequences containing extensive and remote contextual information, and employs a bidirectional transformer pre-trained model to learn latent representations from unlabeled sequences. To the best of our knowledge, FALCON is the first APT investigation method to conduct causal provenance based on cross-domain audit logs while ensuring privacy protection. The experimental results demonstrate that FALCON effectively detects APT attacks with accuracy $99.71\%$ and reconstructs attack scenarios with accuracy $87.4\%$.
\end{abstract}

\section{Introduction}
\label{section:A}
As artificial intelligence reshapes the cybersecurity landscape, organizations globally are encountering a growing array of security and privacy challenges. The extensive deployment of Internet of Things (IoT) devices in critical infrastructure has significantly benefited sectors such as smart cities and industrial automation. Unfortunately, the inherent security vulnerabilities of these devices have made them prime targets for cyberattacks, and the number and type of cyberattacks on the IoT rapidly increase. \citep{al2021x} According to the 2023 report by Palo Alto Networks \citep{palo}, $75\%$ of IoT devices have critical vulnerabilities, with each device experiencing an average of 5,200 attack attempts per week. According to the ENISA report in 2024 \citep{enisa}, $76.29\%$ enterprise networks were targeted by cybercriminals, and $40\%$ supply chain attacks involving IoT devices.

Advanced Persistent Threat (APT) attacks are increasingly characterized by long-term stealth, cross-domain penetration, and multi-target outbreaks, posing significant risks to network infrastructure. APT groups employ sophisticated intrusion kill chains, coordinated attack campaigns, and bespoke tactics, techniques, and procedures (TTPs) \citep{sun2023cyber} to compromise networks and exfiltrate sensitive information assets. The MITRE ATT\&CK indicates that lateral movement and persistence are critical stages in APT attacks \citep{MIRTE}, enabling attackers to exploit vulnerabilities to traverse and penetrate continuously, such as APT28 and APT29 utilizing zero-day exploits \citep{ACMCCS2022}. Additionally, they expand the influence through supply chain compromise \citep{supplychain} or third-party-associated attacks \citep{usenixThird}, such as SolarWinds \citep{hassija2020survey} and APT41's infiltration of the global manufacturing industry \citep{APT41}. Therefore, there is a critical need to adopt a proactive approach to investigate attacks and uncover latent threats.

To identify potential security risks and enhance APT investigations capabilities, collaboration between different organizations and departments is essential. However, it is challenging to pool raw audit logs from multiple parties due to the privacy policies. Federated Learning (FL) is an emerging decentralized collaborative paradigm initially proposed by McMahan et al., aiming to address the challenges of data silos and privacy preservation \citep{mcmahan2017communication}. Collaborative analysis of multi-source threat intelligence based on the FL framework can not only reconstruct a complete APT attack chain without privacy disclosure, but also provide clear technical anchor points for the traceability of APT attack.

%Lateral movement is a critical technique in APT attacks \citep{MIRTE}, enabling attackers to progressively expand their control by hopping and penetrating. 


\begin{figure}[!htb]
\centering
\includegraphics[width=\linewidth]{Figs/figure-1.png}
\caption{A typical IoT attack scenario. The vulnerability IDs and attack steps are described in red dashed boxes. Attackers gain unauthorized access and leverage the compromised IoT device as a foothold to lateral movement. Then they bypass the firewalls and IDS by exploiting the vulnerabilities to infiltrate the server.}
\label{fig:fig1}
\end{figure}

A sophisticated attack example for compromising IoT is shown in Figure \ref{fig:fig1}, which includes several types of cross-domain attack scenarios: cross-terminal, cross-department, and cross-organization. Multi-source intelligence collaboration among different security domains enhances security detection and attack investigation. The prior APT attack investigation methods have these limitations: limited cross-domain analysis, insufficient awareness of potential threats, and privacy constraints. Data privacy concerns significantly restrict threat intelligence sharing, hindering the effectiveness of security operations \citep{FLforIIoT}. The inaccuracies in the causal relationships between provenance graphs from different domains can result in fragmented attack scenarios. Furthermore, audit logs collected by terminal devices are frequently limited and homogeneous, leading to a scarcity of well-annotated attack samples and reduced accuracy in identifying unknown behavior.

To address the limitations, we propose \uline{F}ederated C\uline{A}usal Provenance \uline{L}earning for \uline{C}r\uline{O}ss-DOmai\uline{N} Attacks (FALCON), which aims to explore the cross-domain threats with limited data samples and reconstruct fully attack scenarios. The main contributions can be summarized as follows:

\begin{itemize}
%\setlength{\itemsep}{1pt}
%\setlength{\parsep}{0pt}
%\setlength{\parskip}{0pt}
  \item We introduced FALCON, a system designed to enable efficient collaborative intelligence analysis while ensuring privacy security. Mining fine-grained causal relationships from multi-source logs with provenance graphs to trace threat behavior.
  \item We propose a heuristic method with few-shot learning. The local model exploits pre-training tasks to learn more accurate semantical information and optional downstream task training modules to accommodate both labeled and unlabeled samples.
  \item The experimental results show that FALCON significantly outperforms existing methods, achieving higher AUC values of 0.9625 and 0.9497 for the authenticity of alarm events and system event discovery. The system exhibits robust generalization on public datasets.
\end{itemize}

\section{Related Works}
\label{section:B}
\subsection{APT Attack Investigation}
In recent years, the concept of using causality analysis through provenance graphs from audit logs has gained widespread application in attack detection \citep{zengy2022shadewatcher,cheng2023kairos}, attack investigation \citep{alsaheel2021atlas,ding2023airtag}, and attack scenario reconstruction. Holmes \citep{milajerdi2019holmes} and RapSheet \citep{hassan2020tactical} use TTPs rules to match within provenance graphs, aiming to discover threat behaviors at both the technical and strategic levels. The coarse-grained nature of audit logs introduces the challenge of dependency explosion. MORSE \citep{hossain2020combating} addresses this challenge by introducing tags decay and tag propagation rules. 

Utilize graph summarization to encapsulate the semantics of behaviors, enabling efficient and accurate attack investigation. DEPcoMM \citep{xu2022depcomm} identifies clusters as process-centric communities within large-scale provenance graphs. OmegaLog \citep{hassan2020omegalog} generates more concise attack provenance graphs with rich semantic information. The log records a vast amount of system events may affect the accuracy of attack path tracing. DEPIMPACT \citep{fang2022back} assigns distinguishable dependency weights to edges to differentiate critical. WATSON \citep{zeng2021watson} combines event semantics as representations of behaviors and reduces analysis workload by two orders of magnitude. 

Deep learning can build attack investigation models by learning normal and attack behavior features. ATLAS \citep{alsaheel2021atlas} constructs sequences including rich semantic information at the system level and training with Long Short-Term Memory (LSTM) \citep{memory2010long}. However, LSTM training is time-consuming and requires a large amount of high-quality labeled data. Log2Vec \citep{liu2019log2vec} constructs heterogeneous graphs with predefined rules and employs clustering to separate malicious behavior from benign behavior without leveraging GNN. AIRTAG \citep{ding2023airtag} directly performs representation learning (RL) on audit logs with LSTM. This method relies solely on BERT, limiting its ability to capture rich context.

\subsection{FL for Security}
IoT devices are vulnerable to cyberattacks owing to dispersed locations, limited computational resources, and handling of sensitive data. Wang et al. \citep{wang2023edgeguard} proposed a lightweight FL framework for real-time anomaly detection on resource-constrained IoT devices. Existing security technologies leveraging FL predominantly focus on traffic analysis \citep{rel-1-rodriguez2023survey,rel-3-salim2024fl} rather than log analysis \citep{rrr-40-8884802, wang2024hierarchical} proposed an autonomous self-learning distributed system for detecting anomalies in IoT devices. DeepFeed \citep{rrr-33-9195012,tan2022federated} applies federated deep learning to detect cyber threats against industrial cyber-physical systems.

Federated learning enables collaborative analysis of multi-source threat intelligence without sharing original log data. Some researches developed multimodal FL model \citep{bahadoripour2024explainable} that integrates logs, traffic, and sensor data to improve the accuracy of APT attack detection. Mimura et al. \citep{hu2023privacy} proposed a privacy-preserving few-shot traffic detection method, treating the APT detection task as a model generalization optimization process to identify unknown local samples. Saeed et al. \citep{saeed2020federated} proposed a self-supervised method based on wavelet transform to learn models from scattered data, which performed well in both centralized and federated environments. Xiong et al. \citep{8979384} introduced a practical Real-Time design for detecting known and unknown APT attacks in real-world scenarios.
%Moustafa et al. \citep{moustafa2023explainable} provide a comprehensive review of artificial intelligence and security incidents for anomaly-based intrusion detection in IoT networks.

\section{Preliminaries}
\label{section: C}
This section commences by providing several formal definitions of the requisite preliminaries and a threat model given in Appendix \ref{A}, aiming at facilitating a comprehensive understanding of the proposed methodology.

\textbf{Definition 1}: \textit{System Event}. \textit{System events} are formal representations of audit logs. It is defined as a quadruple $event=\langle sub, obj, oper, T_s\rangle $, where $sub$ and $obj$ denote objects, both of which are system entities. The entity type set $sub$ is ${\{Process\}}$ and the entity type set of $obj$ is ${\{Process, File, Socket\}}$. $oper$ denotes the operation from a subject $sub$ to an object $obj$, which also denotes causal relationships and information flow between entities. $T_s$ represents the timestamp of the system event. 

\textbf{Definition 2}: \textit{Provenance Graph}. A \textit{provenance graph} is generated from the system events by linking the entities with causal relationships, representing the behavior processes and information flows in the operation system level. Provenance graphs are labeled directed graphs, which are formalized as $G_p=(V_{entity}, E_{oper})$, where $V_{entity}$ is the set of entity nodes with attributes and $E_{oper}$ is the set of directed edges with labels. In a provenance graph, multiple edges may exist between two entities, which represent the behavior of operations at different times.

\textbf{Definition 3}: \textit{Behavior Sequence}. The \textit{behavior sequence} is introduced in this work to describe the interaction process of behavior instances in the system level.  A behavior sequence indicates that a temporally-ordered chain of system events, represented as $Seq_{B}^l=\{event_1, event_2, ...,event_l\}$. Behavior sequences contain extensive and distant contextual information of system events. Using this contextual information allows sequence learning models to learn features and patterns of behavior sequences, leading to accurate classification.

\textbf{Definition 4}: \textit{Attack Scenario}. An \textit{attack scenario} provides a comprehensive representation of the entire attack process, encompassing entities relevant to the attack and the causal relationships between these entities. It is represented as $G_{as}=(V_{att\_entity}, E_{att\_oper})$. Compared to the long manual examination to analyze the access points and the potential impact of an attack from audit logs, attack scenarios enable security analysts to more intuitively understand the complete process of an attack.

\section{Model Architecture}
\label{section:D}
This section introduces the architecture of FALCON, and the overall workflow in IoT systems is illustrated in Figure \ref{fig:fig2}. The model is based on the following assumptions: a typical horizontal FL framework with personalized optimization with private datasets and share the same feature space.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=\columnwidth]{Figs/figure-2.png}
  \caption{Overall architecture of FALCON. The local client module trains local models to detect APT attacks. The processes are provenance graph partitioning and optimization, behavior sequence construction, and optimal fine-tuning training. The server module aggregates and updates models and facilitates cross-domain APT attack investigation.}
  \label{fig:fig2}
\end{figure}

\subsection{FL structure for FALCON}
We leverage a typical horizontal FL paradigm to enable collaborative attack investigation where have private audit logs as local datasets but share the same feature space. As shown in Figure \ref{fig:fig2}, the overall framework comprises local client module and server module. Through the utilization of cross-terminal causal traceability and localized pre-training optimization, the model's capability of APT attacks investigation has been substantially improved. We further examined the methodology for local model updates, wherein clients extract behavior sequences to capture causal relationships. Inspired by personalized FL\citep{tan2022towards}, we aim to learn shared data representations across clients while establishing a unique local output layer for each client. To ensure the consistency of the optimization objective, a proximal term was incorporated into the client loss function. The proximal term penalizes large deviations from the global model, thereby stabilizing training with heterogeneous logs. Client $i$ approximately minimize the following objective, safely incorporating variable amounts of local work.

\begin{equation}
  \min_{w}h_i(w;w^t)=F_i(w)+\frac{\mu}{2}\left \| w-w^t \right \| ^2,
\end{equation}

where $F_i(\cdot)$ is the local function. The central server collects the local model updates $w_i$ from client $i \in S_t$ randomly chosen, then aggregates $N$ clients with weighted averaging, which is calculated as follows:

\begin{equation}
  w^{t+1} = \frac{1}{N}\sum_{i \in S_t} w^{t+1}_i,
\end{equation}

The updated global model is then broadcast back to all clients, which incorporate these improvements into their subsequent local training cycles. The local model receives the aggregated global model update parameters from the central server at the start of each new training round, ensuring that they benefit from the latest global insights. This iterative process accommodates variable local workloads and is further enhanced by adaptive hyperparameter tuning. The inference flow is as follows: audit log $\to$ provenance graph $\to$ behavior sequence $\to$ semantic vector $\to$ detection $\&$ investigation. During inference, when a new alert is generated, the corresponding behavior sequence is processed to produce an embedding that feeds into a classifier (or an unsupervised technique). This classifier determines whether the sequence is indicative of an ongoing APT attack. To achieve cross-domain attack scenario reconstruction, critical information regarding malicious behaviors is shared concurrently with the enhancement of the client's model-checking capability. Therefore, FALCON scales across distributed environments, improving detection accuracy for APT investigations and ensuring robust convergence under conditions of heterogeneous system capabilities. 


\subsection{Behavior Sequences Construction}
\textbf{Provenance graph construction.} Aiming at the problem of missing correlations in cross-terminal provenance graphs, FALCON designed a cross-terminal entity correlation method based on event occurrence time and information flow. We conduct information flow analysis based on aligned system events to determine the real relationship. Figure \ref{fig:fig3} depicts the constructed cross-terminal provenance graph. FALCON identifies system events with causal relationships from the provenance graph of terminals in established communication. The construction of the cross-terminal provenance graph is realized by removing directed edges and socket nodes in two system events and constructing a correlation relationship between processes in them. 

\begin{figure}[!htb]
  \centering
  \includegraphics[width=\linewidth]{Figs/figure-3.png}
  \caption{Cross-domain provenance graph partitioning and reduction: This process involves reducing provenance graphs while preserving essential relationships and information. The lateral movement process is marked in red.}
  \label{fig:fig3}
\end{figure}

\textbf{Provenance graph partition and optimization.} In response to the coarse-grained redundancy of audit logs, FALCON partitions the constructed original provenance graph utilizing the similarity and removes redundant events and calls. To calculate the similarity between system events, FALCON extracts three features as: time interval feature $f_{TI}$, probability feature $f_{Prob}$, and entity attributes feature $f_{EA}$. (see Appendix \ref{B.2}) The probability that two system events belong to the same behavior is calculated as follows:
%\vskip -0.1in
\begin{equation}
  P_{sim}(event)=\lambda \cdot f_{TI}+\mu \cdot f_{Prob}+\delta \cdot f_{EA},
\end{equation}
%\vskip -0.1in
where the coefficients $\lambda$, $\mu$, and $\delta$ represent weighting factors that can be adjusted based on the characteristics to balance the contributions of the features. Figure \ref{fig:fig3} illustrates the provenance partition and optimization process with examples. FALCON uses HDBSCAN \citep{mcinnes2017accelerated} to implement our clustering task, which can receive a matrix and does not need to declare the number of clusters in advance. Through the segmentation of extended-duration processes and the consolidation of redundant events, the initial provenance graphs are converted into behavior-focused provenance graphs that concisely capture system activities.

\textbf{Behavior sequence extraction.} FALCON designs a Depth-First Search (DFS) method with specific conditions to extract behavior sequences for each system event from Behavior Partition Graphs (BPGs). Since obtaining labeled data is not feasible in real IoT environments, the traversal paths are determined based on the frequency of event occurrences. This ensures obtaining distant contextual relationships without excessively long behavior sequences. The detailed definitions are provided in Appendix \ref{B}.

\subsection{Local Model Training}
\textbf{Tokenization of behavior sequence.} Tokenization is splitting events in a sequence into smaller units (tokens) and using these tokens to represent a sequence of behaviors. Therefore, FALCON constructs a token dictionary from the words in the behavior sequences, $Dict_{Seq_B}$, including entities and operations. Similarly to the BERT model, several special tokens [CLS], [SEP], [PAD], and [MASK] are added to the dictionary. The embedding representation of a tokenized sequence is constructed by integrating token embeddings, positional embeddings, and segment embeddings.


\textbf{Pre-training.} FALCON achieves representation learning on a large set of unlabeled behavior sequences by designing pre-training tasks and maps words and sequences into a vector space. In this vector space, semantically similar words and sentences have closer distances. FALCON designs two pre-training tasks along with corresponding loss functions for representation learning at the word and sequence levels. We present details in Appendix \ref{B.5}.

(1) Masked Entity Prediction (MEP) task. FALCON employs a multi-layer bidirectional Transformer architecture for training. This model architecture leverages the MEP task to capture bidirectional contextual information for embedding a specific word. Here, FALCON employs the negative log-likelihood function as the loss function.
    \begin{equation}
    \mathcal{L}_{MEP}=-\textstyle{\sum_{i = 1}^{M}log(p(Seq_{i}^{mask}=tok_i|\theta,\theta_1))}
    \end{equation}
Where $M$ is the number of masked entities, $\theta$ is the parameters of the Transformer Encoder, $\theta_1$ is the parameter of the output layer connected to the Encoder in the Masked Entity task. Probability function $p(\cdot)$ depends on the parameters $\theta$ and $\theta_1$, $Seq_{i}^{mask}$ represents a token masked at the $i-th$ position in the tokenized behavior sequence.

(2) Sequence Homology Prediction (SHP) task. The goal of the SHP task based on sequence-level representation learning is to predict whether two behavior sequences originate from the same BPG. Behavior sequences from the same origin graph may exhibit potential causal relationships. SHP is a binary classification, so FALCON employs binary cross-entropy loss function for training.
\begin{align}
    \mathcal{L}_{SHP}&=-\textstyle{\sum_{i = 1}^{N}log(p_i(n=n_i|\theta,\theta_2))},\\
    n_i&\in \{Homologous,NonHomologous\} \nonumber,
\end{align}
where $N$ is the number of input sequence samples. $\theta_2$ is the parameter of the output layer connected to the Encoder in the SHP task. $p_i$ denotes the predicted value of the model for $i-th$ sample.
To capture both token-level and sequence-level features, FALCON utilizes a combined loss of MEP and SHP objectives as the overall pre-training loss.
\begin{equation}
\mathcal{L}_{overall}=\mathcal{L}_{MEP}+\mathcal{L}_{SHP}.
\end{equation}

\textbf{Fine-tune for the downstream task.} FALCON has designed optional fine-tuning modules for the downstream task. In scenario where obtaining high-quality labeled data is challenging, FALCON employs an unsupervised classification to train on unlabeled datasets. We employ One-Class Support Vector Machine (OC-SVM) for unsupervised classification training. When high-quality labeled data is available from the TDS in the IoT system. FALCON utilizes this labeled data to simultaneously learn from both attack and normal behavior sequences, achieving fine-tuning of the model. (More in Appendix \ref{B.6})

\subsection{Attack Investigation Analysis}
The goal of FALCON is to automatically investigate the authenticity of TDS alarms in IoT systems, identify undetected attack events, and construct complete attack scenarios. An attack case study was conducted to demonstrate the effectiveness of FALCON in IoT attack investigation (see Appendix \ref{E.6}).Each client constructs an independent APT attack scenario based on local system events associated with known malicious behaviors. Moreover, clients upload critical information of malicious behaviors, allowing us to reconstruct a global APT attack scenario across domains on the server. Since only model updates and malicious behavior information are shared, FALCON essentially mitigates the potential privacy risks associated with sharing raw audit logs. Each client constructs independent APT attack scenario based on local system events associated with known malicious behaviors. Moreover, clients upload critical information of malicious behaviors, allowing us to reconstruct a global APT attack scenario across domains on the server. Since only model updates and malicious behavior information are shared, FALCON essentially mitigates the potential privacy risks associated with sharing raw audit logs.

FALCON chooses low-frequency events as traversal paths to construct behavior sequences with system events corresponding to alarms as root nodes. FALCON employs the trained model to determine whether the behavior sequences associated with alerts are malicious. Attacks that evade TDS and remain lurk can cause more serious harm to IoT systems. For potential undetected attack events, FALCON analyzes all system events in the IoT system. Specifically, FALCON constructs BHGs and extracts behavior sequences. The extracted behavior sequences are predicted using the trained model, and the system events corresponding to the predicted malicious sequence are flagged as malicious.

Reconstructing the attack scenario involves establishing associations between system events corresponding to real alerts and malicious events, and generating attack scenario graphs to depict the entire attack process. Specifically, FALCON constructs the attack graph by considering the reachability of malicious events belonging to the same behavior origin graph. When searching for reachable paths in the behavior origin graph, multiple paths may exist. To minimize the cost of attack implementation, attackers typically choose the shortest path to execute the attack. Therefore, FALCON calculates scores for all paths based on the frequency of events occurring on the path and the path length.
    \begin{align}
        Score_{path_i}=&\frac{1}{len_{path_i}}\sum_{n = 1}^{L} Freq_{event_n}\times \frac{len_{path_i}}{max_{len}}, \nonumber\\
        =&\frac{\sum_{n = 1}^{L} Freq_{event_n}}{max_{len}},
    \end{align}
Where $path_i \in \mathcal{PATH}=\{path_1,\dots,path_m\}$ is one of all reachable paths between two malicious system events. $max_{len}=MAX\{len(\mathcal{PATH})\}$ represents the length of the longest path among all accessible paths. FALCON selecting the path with the minimum score to connect two malicious events, which represents the most likely path chosen by the attacker in the combined lowest frequency and shortest path case. In the end, FALCON can reconstruct a comprehensive attack scenario spanning multiple hosts. The resulting scenario graph is concise and contains crucial information about the attack.

\section{Experiment}
\label{section: E}
In this section, we evaluate FALCON from multiple dimensions of experiments and present the main results. We also perform ablation studies and explainability analyses.
\subsection{Experiment Setup}
%In order to verify FALCON's performance of APT attack investigation, 
The IoT system comprises 20 IoT devices, 6 edge servers, and one cloud service, with each edge server connected to at least two IoT devices. We perform the experiments to analyze audit logs on each server docker with an Intel(R) Xeon(R) Silver 4215R CPU (with 8 cores and 3.20 GHz of speed each), a GeForce RTX 3090, and 256 GB of memory running on Ubuntu 18.04.5 LTS. Set IoT devices as data collection terminals and edge servers as clients for local model training, while cloud servers perform global model aggregation to facilitate threat intelligence sharing. For a distributed FL structure, we deploy a central server, six edge servers, and a high-performance host as isolated client devices. Each client manages 2 or 3 IoT terminals and performs model training within Docker containers. We constrain the number of communication rounds between 200 and 500, dynamically adjusting the local iteration counts based on requirements to enhance the convergence performance of the global model. Simulating five APT attacks based on detailed reports \citep{hanh2022vietnam}, each complete attack lasted at least 2 days. More details of the implementation are listed in the Appendix \ref{E.1}.

\begin{table*}[ht]
  \centering
  \caption{APT Investigation: The results of FALCON in investigating the authenticity of alerts and identifying lurking attack events within system events, along with the ground truth information for each executed attack.}
    \begin{tabular}{c|ccc|ccc|ccc}
    \toprule
    \multicolumn{1}{c}{\multirow{2}[2]{*}{\textbf{Scenarios}}} & \multicolumn{3}{c}{\textbf{Ground Truth}} & \multicolumn{3}{c}{\textbf{Alerts Investigation Result}} & \multicolumn{3}{p{12.57em}}{\textbf{Events Investigation Result}} \\
    \multicolumn{1}{c}{} & \multicolumn{1}{c}{Events} & \multicolumn{1}{c}{True} & \multicolumn{1}{c}{False} & \multicolumn{1}{c}{\textit{Precision}} & \multicolumn{1}{c}{\textit{Recall}} & \multicolumn{1}{c}{\textit{F1-score}} & \multicolumn{1}{c}{\textit{Precision}} & \multicolumn{1}{c}{\textit{Recall}} & \multicolumn{1}{c}{\textit{F1-score}} \\
    \midrule
    OceanLotus & 57    & 7     & 48    & 100.00\% & 97.92\% & 98.95\% & 95.00\% & 100.00\% & 97.44\% \\
    APT28 & 68    & 8     & 39    & 97.50\% & 100.00\% & 98.73\% & 95.65\% & 97.06\% & 96.35\% \\
    Kimsuky & 44    & 6     & 52    & 100.00\% & 96.15\% & 98.04\% & 97.67\% & 95.45\% & 96.55\% \\
    attack 1 & 38    & 0     & 31    & 100.00\% & 96.77\% & 98.36\% & 97.30\% & 94.74\% & 96.00\% \\
    attack 2 & 31    & 4     & 23    & 95.65\% & 95.65\% & 95.65\% & 96.77\% & 96.77\% & 96.77\% \\ 
    \hline
    Total or Avg. & 238   & 25    & 193   & 98.95\% & 97.41\% & 98.17\% & 96.25\% & 97.06\% & 96.65\% \\
    \bottomrule
    \end{tabular}%
  \label{tab:tab1}%
\end{table*}%

\textbf{Datasets.} Based on detailed reports of real-world APT campaigns, we conducted five simulated attacks and generated audit logs within a controlled IoT testbed environment, presented as the IoT dataset. Excepting the classical attack scenarios OceanLotus \citep{hanh2022vietnam}, APT28 \citep{freebuf2018}, and Kimsuky \citep{yoroi2020}, we also design two sophisticated attacks exploiting some new vulnerabilities. Additionally, two widely used datasets are used to evaluate the generalizability of FALCON, the ATLAS dataset \citep{alsaheel2021atlas} and the CADETS dataset \citep{torrey2020}. The attribute information is listed in Appendix \ref{E.2}. We introduced two mixed datasets: one combining IoT and ATLAS datasets and another combining all three datasets. It is important to note that mixing occurs before the samples enter the model, not on the original audit data. 

\textbf{Evaluation Setup.} In the experimental environment, we simulated some unauthorized actions by normal users, such as elevating process privileges (triggering an alarm) and then running a program to read and write multiple files. This series of actions exhibited behavior patterns similar to those of attackers running malicious software. \textit{Precision}, \textit{Recall}, \textit{F1-score}(see in the appendix \ref{E.3}), as well as common evaluation metrics such as \textit{ROC} curve and \textit{AUC}, are used to evaluate the performance of FALCON in attack investigation.To quantitatively evaluate the reconstruction results, we design three metrics: $\textit{SNE}$, $\textit{DNE}$, and $\textit{SIM}$.

\begin{align}
%\setlength\abovedisplayskip{-6em}
     \textit{SNE} &= \frac{|SN|+|SE|}{|N_{GT}|+|E_{GT}|},\\
     \textit{DNE} &= \frac{|DN|+|DE|}{|N_{GT}|+|E_{GT}|},\\
     \textit{SIM} &\!=\! \frac{|SN|+|SE|}{max\{(|N_{GT}|\!+\!|E_{GT}|),(|N_{R}|\!+\!|E_{R}|)\}},
%\setlength\belowdisplayskip{-6em}
\end{align} %\vskip -0.1in

where $SN$ and $SE$ represent the same nodes and edges, $DN$ and $DE$ represent different nodes and edges. $N_{R}$ and $N_{GT}$ denote nodes in the reconstructed graph and ground truth graph, respectively, $E_{R}$ and $E_{G}$ represent edges. A significant proportion of the graph consists of the same nodes and edges ($\textit{SNE}$), indicating that FALCON's reconstructed attack scenarios include most of the critical attack events. Different nodes and edges ($\textit{DNE}$) represent those attacks appear in the scene graph but not in the attack graphs. %\vskip -0.4in

\subsection{APT Attack Investigation Evaluation}
Threat Detection Systems (TDS) can only detect a limited number of attack events and tend to generate a significant number of false alerts, which can be observed from the column "Ground Truth" in Table \ref{tab:tab1}. There are 238 system events directly related to the attacks, as well as numerous system calls generated alongside these attack events. Within two weeks, IDS and the firewall generated a total of 218 alerts, with 25 true alerts and 193 false alerts. The reason behind this lies in the fact that attackers often disguise their behavior to evade detection.

The "Alerts Investigation Result" presented in columns 5 to 7 of Table \ref{tab:tab1} indicates that FALCON can accurately determine the authenticity of alerts, achieving an average \textit{precision} of $98.95\%$, a \textit{recall} rate of $97.41\%$, and an \textit{F1-score} of $98.17\%$. The fifth column shows that among all predicted true alarms, $98.95\%$ were triggered by attack events. Upon analysis, false positives were caused by some normal behaviors that resemble behavior patterns of attacks. The sixth column indicates an average recall rate of $97.41\%$. Errors originated from some failed initial intrusion attempts.

The "Events Investigation Result" in columns 8 to 10 of Table \ref{tab:tab1} shows that FALCON can recognize benign events with an average precision of $96.25\%$, a recall rate of $97.06\%$, and an F1-score of $96.65\%$. The results indicate that FALCON's performance in identifying attack events is slightly better than benign events. The occurrence of false negatives in the event investigation is also attributed to failed access attempts during the initial access phase. (More in the appendix \ref{E.4}.)

\begin{figure}[!htb]
  \centering
  \includegraphics[width=\linewidth]{Figs/figure-4.png}
    \caption{ROC curve and AUC of FALCON in attack investigation}
    \label{fig:fig4}
\end{figure}

The ROC curve and AUC value for FALCON during alerts investigation are illustrated in Figure \ref{fig:fig4}. It demonstrates that FALCON is capable of accurately determining the authenticity of alerts. AuditLogBERT can accurately identify system events corresponding to attacks with a \textit{precision} of $95.87\%$, \textit{recall} rate of $97.48\%$, and \textit{F1-score} of $96.67\%$. The figure displays the ROC curve for FALCON during event investigation, with an AUC value of AUC=0.9497. The above results show that AuditLogBERT can effectively make up for the shortcomings of TDS and detect the key attack steps missed. The above results indicate that FALCON can effectively judge the authenticity of TDS alerts and detect critical attack steps overlooked by TDS.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=\columnwidth]{Figs/figure-5.png}
    \caption{Comparing the identical and different nodes and edges between the reconstructed attack and the ground truth scenarios.}
    \label{fig:fig5}
\end{figure}

\subsection{APT Attack Scenario Reconstruction}
To evaluate the effectiveness of FALCON in reconstructing attack scenarios, we compared the reconstructed attack scenario graphs with the ground truth of attack graphs. Figure \ref{fig:fig5} shows the number of nodes and edges that are same or different between them. The results show that the number of nodes and edges in the reconstructed attack scenario graphs are generally higher than the attack graphs. These differences are mainly benign events misclassified as attack events and events included in the incorrectly chosen paths during the reconstruction of the attack scenario. Some attack events not identified by FALCON are reconstructed in the attack scenario through path selection.
% Table generated by Excel2LaTeX from sheet 'Sheet4'
\begin{table}[hbt]
    %\vspace{-1em}
    \caption{Results of attack scenario reconstruction, in terms of ground truth and three metrics, \textit{SNE}, \textit{DNE}, and \textit{SIM}.}
    \label{tab:tab2}%
    \begin{center}
    %\begin{small}
      \begin{tabular}{c|cc|ccc}
      \toprule
      \multicolumn{1}{c}{\multirow{2}{*}{\textbf{Scenarios}}} & \multicolumn{2}{c}{\textbf{G-Truth}} & \multicolumn{3}{c}{\textbf{Rconstruction results}} \\
      \multicolumn{1}{c}{} & \multicolumn{1}{c}{$\left\lvert N \right\rvert $} & \multicolumn{1}{c}{$\left\lvert E \right\rvert $} & \multicolumn{1}{c}{\textit{SNE}} & \multicolumn{1}{c}{\textit{DNE}} & \textit{SIM} \\
      \midrule
      OceanLotus & 48    & 57  & 92.4\% & 7.5\% & \multicolumn{1}{c}{91.5\%} \\
      APT28 & 43    & 68  & 90.1\% & 14.4\% & \multicolumn{1}{c}{86.2\%} \\
      Kimsuky & 29    & 44  & 94.5\% & 12.3\% & \multicolumn{1}{c}{88.5\%} \\
      Attack4 & 31    & 38  & 87.0\% & 10.1\% & \multicolumn{1}{c}{87.0\%} \\
      Attack5 & 25    & 31  & 91.1\% & 17.9\% & \multicolumn{1}{c}{83.6\%} \\ 
      \hline
      Total or Avg. & 176   & 238   & 91.1\% & 12.3\% & 87.4\% \\
      \bottomrule
      \end{tabular}%\
%\end{small}
\end{center}
%\vspace{-1em}
\end{table}%

The results of attack scenario reconstruction are shown in table \ref{tab:tab2}. FALCON reconstructed attack scenarios that, on average, included $91.1\%$ of the attack edges and nodes, with a similarity ranging from $83.6\%$ to 91.5\%. The experimental results indicate that FALCON can accurately reconstruct concise APT attack scenario graphs from multi-source and heterogeneous audit logs. It removes system events that are not directly related to the attack, retaining only key attack system events. These simplified attack scenario graphs can assist security analysts in quickly understanding the complete attack process during attack investigations, identifying the attack entry points, and assessing the impact.

\subsection{Ablation Experiment}
The ablation experiments are conducted on multiple datasets to evaluate the factors influencing FALCON's performance. The proposed behavior provenance graph, pre-training model, and downstream task classifiers are validated to enhance FALCON's capability in conducting APT attack investigations. The comparison of downstream task classifiers and more details are demonstrated in Appendix \ref{E.5}.
% Table generated by Excel2LaTeX from sheet 'Sheet3'
\begin{table}[H]
    %\vspace{-1em}
    \caption{Performance of attack investigations using graph partition and optimization algorithms and raw graphs.}
    %\vspace{-1em} 
    \label{tab:tab3}%
    \begin{center}
    %\begin{small}
    \resizebox{0.52\textwidth}{!}{
    \belowrulesep=1pt
    \aboverulesep=0pt
      \begin{tabular}{c|c|c|ccc}
      \toprule
      \multirow{2}{*}{\textbf{Datasets}} & \multirow{2}{*}{\textbf{Graphs}} & \multicolumn{1}{c|}{\textbf{Time}} & \multicolumn{3}{c}{\textbf{Events Investigation Result}} \\
            &       & \multicolumn{1}{c|}{\textit{(h:m:s)}} & \multicolumn{1}{c}{\textit{Precision}} & \multicolumn{1}{c}{\textit{Recall}} & \multicolumn{1}{c}{\textit{F1-score}} \\
      \midrule
      \multirow{2}{*}{IoT Dataset} & $G_{Raw}$    &   3:47:49    &  56.56\%  &  81.51\%  &  66.78\%  \\
         & $G_{Opt}$    &   1:14:09    &   96.25\%  &  97.06\%  &  96.65\%  \\
      \midrule
        ATLAS & $G_{Raw}$    &    2:39:06   &   68.32\%     &   75.89\%    & 71.91\% \\
        \citep{alsaheel2021atlas}    & $G_{Opt}$    &   0:58:23    &  97.39\%  &  98.60\%   &  97.99\%  \\
      \midrule
        CADETS & $G_{Raw}$    &   5:29:54    &   74.76\%    &   76.95\%	& 75.84\%  \\
        \citep{torrey2020}    & $G_{Opt}$    &   1:35:47    &   96.89\%    &   98.92\%    &  97.89\% \\
      \bottomrule
      \end{tabular}%
    }
    %\end{small}
    \end{center}
    %\vspace{-1em}
  \end{table}%
  
\textbf{Raw graph vs. optimized graph.} The removal of redundant events significantly improves the efficiency of attack investigations, while eliminating errors in dependencies enhances the accuracy of identifying attack events. The results in Table \ref{tab:tab3} demonstrate that the runtime for attack investigations significantly decreases when using the optimized BHGs compared to the raw graphs on the three datasets. The performance of attack investigation using the raw graphs is the poorest in the IoT dataset, with a significant decrease in the F1 score by 29.87\%. This decline can be attributed to the heterogeneity of the data and semantic differences, which introduce considerable noise, making it challenging for the model to effectively learn patterns within the behavior sequences. And all metrics of the event investigation results are higher than those using the raw graphs. Therefore, the proposed approach of obtaining origin provenance graphs through partitioning and optimization enhances the efficiency and performance of attack investigations.

\textbf{Pre-training model.} To evaluate the effectiveness of the pre-trained model proposed by FALCON in embedding behavior sequences more efficiently, we compared it on three datasets with four typical deep learning models, including CNN, LSTM, and two state-of-the-art pre-trained models for sequence analysis and natural language processing, BERT and RoBERTa.

\begin{figure*}[!htb]
  \centering
  \includegraphics[width=\linewidth]{Figs/figure-6l.png}
  \caption{To evaluate the effectiveness of the pre-trained model proposed by FALCON in embedding behavior sequences compared with several deep learning models.}\label{fig:fig6}
\end{figure*}

\begin{figure*}[!htb]
  \begin{center}
  \centerline{\includegraphics[width=\linewidth]{Figs/figure-7l.png}}
  \caption{Comparative results with existing APT advanced attack investigation methods, in terms of ATLAS and AIRTAG.}
  \label{fig:fig7}
  \end{center}
\end{figure*}

The comparative results between FALCON and different deep learning models are shown in Figure \ref{fig:fig6}. FALCON achieved the best results in all three datasets, with F1 scores reaching $96.65\%$, $97.99\%$, and $97.89\%$ on the IoT dataset, ATLAS, and CADETS, respectively. Compared to FALCON, BERT and RoBERTa showed a decrease in \textit{F1-scores} ranging from $3.92\%$ to $8.34\%$ across the three datasets. The improvement of FALCON compared with BERT indicates that the pre-training task proposed in this paper can effectively promote the downstream task of attack investigation. In the IoT dataset, the largest difference between \textit{F1-score} scores for BERT and FALCON indicates a negative impact of the Next Sentence Prediction (NSP) task on attack investigation tasks in the IoT context. The slight improvement of RoBERTa over BERT also indicates that the NSP task is not suitable for attack investigations.

\subsection{Comparison Experiments}
We performed a comparative analysis between FALCON and existing advanced attack investigation methods to assess the strengths and weaknesses of different approaches. The comparative results are presented in Figure \ref{fig:fig7}. ATLAS and AIRTAG achieved a maximum F1 of only $84.67\%$ on the IoT dataset, while FALCON's \textit{F1-score} improved by $11.98\%$. The results indicate that existing attack investigation methods cannot be directly applied to IoT systems. The poor performance of AIRTAG is attributed to its proposed tokenization strategy, which does not adequately cover semantically rich and diverse IoT audit logs. The primary reason for ATLAS's lowest performance is not only semantic differences but also the scarcity of training samples provided from the IoT dataset.

In the mixed datasets, FALCON maintains strong performance with \textit{F1-score} values of $97.78\%$ and $98.33\%$. The primary reason for ATLAS's poor performance across multiple datasets remains the insufficient number of training samples. Comparing the \textit{F1-score} values of the three methods on the mixed dataset with their respective values on the IoT dataset, we observe a slight improvement in the effectiveness of attack investigation in IoT by adding datasets. The increased number of attack behaviors facilitates learning more attack patterns, thereby improving the recognition of attack events.

\section{Conclusion} We address the challenge of cross-domain APT attack hindrance and the few sample limitations. We propose a novel APT attack investigation method based on FL capturing complex causal relationships, named FALCON. FALCON constructs cross-terminal BPGs from heterogeneous audit logs. FALCON trains adaptive local models with behavior sequences containing extensive and remote contextual information and learns latent representations from unlabeled sequences. The results demonstrate that FALCON is capable of conducting efficient attack investigations in IoT systems and achieves impressive performance. In the future, scaling FALCON to a larger network involves addressing both computational and communication challenges. We will extend its effective APT investigation capabilities to large-scale IoT networks while preserving the critical balance between performance, privacy, and computational efficiency.

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
  This work was supported by the National Key Research and Development Program of China under grant No. 2023YFB3107601.
\end{acknowledgements}


\bibliography{uai2025-274-refs}


% contributions, acknowledgments and references has been removed in pdf for initial submission
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
% \bibliographystyle{uai2025-template}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage
\onecolumn
\title{FALCON: Adaptive Cross-Domain APT Attack Investigation with Federated Causal Learning\\(Supplementary Material)}
\maketitle
\appendix
\section{Threat Model}
\label{A}
The above attack cases intuitively reveal how APT organizations carefully plan attack strategies to penetrate enterprise network defense lines. Analysis of this APT system events requires correlating heterogeneous log data of different types of terminal devices in the enterprise network to discover the complete process of the attack. However, the provenance graph built from the audit logs only records socket attributes when it comes to communication with other terminals or external IPs, such as $ \langle bash\_1931,10.46.146.2:2495,connect,2023/08/27 \rangle $, but does not provide information about the processes involved in establishing communication within the connected host. This causes all remote connection events to be connected to the same socket point. As shown in Figure \ref{fig:figureA1}, in the process of remotely transferring Report.pdf in 10.135.22.6 to 10.46.146.2, the causal relationship between $bash\_1931$ and $bash\_2371$ cannot be directly constructed from the audit log. The above situation makes it impossible to determine the accurate causal relationship and information flow in the cross-host process, which seriously restricts the discovery of cross-terminal APT based on the provenance graph. In response to the above problems, this paper studies the construction method of cross-terminal provenance graph.

\begin{figure}[!htb]
%\vskip 0.1in
\begin{center}
  \centerline{\includegraphics[scale=0.6]{Figs/A-figure1.png}}
  \caption{The correlation between the provenance graph of different terminals.}
  \label{fig:figureA1}
  \end{center}
  %\vskip -0.1in
\end{figure}

Similar to previous efforts in threat discovery based on audit logs \citep{hassan2019nodoze,alsaheel2021atlas,fang2022back,xu2022depcomm,gao2021enabling}, our approach operates under the prerequisite of ensuring the integrity and authenticity of audit logs. Therefore, our threat model assumes that the underlying operating system, audit engine, and monitoring data are integral components of a Trusted Computing Base (TCB). Our methodology does not account for kernel attacks or attacks targeting audit logs, including activities such as log deletion or modification. Although ensuring the integrity and authenticity of logs is a crucial aspect of a security framework, this specific aspect is beyond the scope of our research in this paper. 

\section{Implementation Details}
\label{B}
\subsection{Preprocessing}
\label{B.1}
FALCON analyzes various types of audit logs with different structures, converts them into system events, and builds original provenance graphs based on these system events. Firstly, The set of system events containing sockets is found from the provenance graph. For each event $ e\in E_s $, the extracted attributes include the time of occurrence of the event and the information flow of the event in the terminal, when carrying outgoing information, the information flow $Flow^{in}$ indicates the system event that carries the inflow of information. conversely, when receiving the input of information, the information flow $Flow^{out}$ indicates the system event that carries the output from the output.

Secondly, the time of system events $e_i^A$ and $e_j^B$ in different terminals are aligned. Usually the occurrence time of the system event that establishes a connection between terminals is the same, and the correlation between the process entities in the two events is established through time alignment, and the time alignment formula is as follows. $R_{talig}$ is the set of time-aligned system events.
\begin{equation}
  R_{talig}=\{(e_i^A,e_j^B)|T_{aligned}=e_i^A.time,\ e_j^B.time=T_{aligned}\},
\end{equation}
Finally, on terminal servers with frequent business, multiple system events often occur at the same time, and these system events may correspond to multiple independent behaviors executed at the same time. Therefore, we conduct information flow analysis based on aligned system events to determine the real relationship.
\begin{equation}
  \begin{split}
    R_{relation}&= \left(e_i^A,e_j^B\right)|\left(e_i^A,e_j^B\right)\in\ R_{talig},\\
  & if \  same\left({start}_{entity},{end}_{entity}\right),
  \end{split}
\end{equation}
where $|R_{talig}|=1$, $(e_i^A,e_j^B)$ is the correct correlation. When $|R_{talig}|>1$, access to the file type as the starting node or node at the end of the information flow, compare the file name and file type by the bool function $same(\cdot)$. If two nodes ${start}_{entity}\in{Flows}^{in}$ and ${end}_{entity}\in{Flows}^{out}$ have the same name or file type, $(e_i^A,e_j^B)$ is correct. Through the above steps, FALCON identifies system events with causal relationships from the provenance graph of terminals that have established communication.

\subsection{Details of Partition and Optimization}
\label{B.2}
For each long-running process, FALCON extracts three features to calculate the similarity between system events with this process as the subject or object. Similarity quantifies the likelihood that two events belong to the same behavior. FALCON utilizes this similarity to cluster system events belonging to the same behavior into the same execution partition, thus achieving graph partition.

\textbf{Time Interval Feature $f_{TI(event)}$}. Intuitively, the time interval between system events belonging to the same behavior on a process node is short compared to those belonging to different behaviors. Therefore, we design the time density feature to model this intuition.
\begin{equation}
    f_{TI}(event_i,event_j)=tanh(\dfrac{max\_interval}{|t_{event_i}-t_{event_j}|+\alpha}-\beta),
\end{equation}
where $t_{event_i}$ and $t_{event_j}$ represent the occurs time of the system events $event_i$ and $event_j$. $max\_interval=t_{event_{end}}-t_{event_{start}}$ denotes the maximum time interval between system events within a process. $\alpha$ (we set $\alpha=0.001$) is a positive number used to make sure the denominator is not 0. $\beta=\frac{max\_interval}{max\_interval+\alpha}$ to ensure the value of $f_{TI}(event_i,event_j)$ is 0 when the time interval between events is max\_interval. $tanh()$ satisfies the nonlinear relationship between similarity and time interval, and the range of values is $\left[0,1\right)$.

\textbf{Probability Feature $f_{Prob}(event)$}. During the life cycle of a process, system events belonging to the same behavior tend to occur together within a short period of time. And, similar behaviors share the same pattern of system events.For example, running code all reads library function files, although it may not read the same files. This is manifested in audit data as a higher probability of events belonging to the same behavior occurring together. Therefore, the probability feature is designed to model this analysis.
\begin{equation}
    f_{Prob}(event_i,event_j)=\dfrac{Num_{event_j}}{Num_{processes}},  
\end{equation}
where the $Num_{event_j}$ represents the number of times $event_i$ occurs when $event_j$ is present in the same process, and $Num_{processes}$ is the number of processes that have $event_i$. During the quantity calculation, we abstract the system events to eliminate the influence of noise information. Specifically, we remove the process PID, remove the port of the IP entity, and only retain the file type or suffix for files. For example, the system event $\langle 7z.exe\_38012,C:/Users/Desktop/Threat\_Report.pdf,write\\,timestamp\rangle $ is abstracted as $\langle 7z.exe,pdf,write\rangle $.

Entity Attributes Feature $f_{EA}(event)$. Intuitively, entities or objects in the system events belonging to the same behavior have a high degree of similarity. For example, when installing a program or software, resource files are mostly written to a specified folder, and the types of files written are mostly similar. FALCON designs entity attribute features as an important indicator for quantifying the similarity of system events.

FALCON only calculates the similarity of events with the same entity type, i.e., when the types of two entities are different, $f_{EA}(event)=0$. If the types of two entities are the same, $f_{EA}(event)$ is calculated by the following formula.
\begin{equation}
    \begin{aligned}
f_{EA}&(event_i,event_j) =
\left\{
\begin{aligned}
    &\frac{num\_token(entity_i,entity_j)}{MAX\{len_{token_i},len_{token_j}\}}, \quad if \quad type=File\\
    &\frac{num\_bit(entity_i,entity_j)}{33}, \quad if \quad type=Socket \\
    &same_{name}(entity_i,entity_j), \quad if \quad type=Process \\
\end{aligned}
\right.
\end{aligned}
\end{equation}
When both system events operate on entities of file types, the path is tokenized based on directory names. The entity attribute feature is quantified by counting the number of the same initial tokens. Before calculating the $num\_tokens_{(entity)}$, the $entity_i$ in the $event_i$ is tokenized as $Dires_i=[directory_1,directory_2,\dots,file\_type].$ If the entity type is Socket, $f_{EA}(event)$ is calculated by counting the number of the same initial bit and $num\_bit_{entity}$ implements the above description. IP addresses are changed as binary. $same_{name}(entity_i,entity_j)$ is a bool function; if the two process names are identical, the value of $f_{EA}$ is set 1; otherwise, 0.

The processed subgraph can succinctly describe the corresponding high-level behaviors, enhancing the efficiency of subsequent analysis and ensuring the accuracy of generated behavior sequences. We obtained two similarity matrices describing the system events similarities, both $in$ and $out$, within a particular process, $P_{sim}^{in}(i,j)\in \textbf{P}^{N\times N}_{in}$ and $P_{sim}^{out}(i,j)\in \textbf{P}^{N\times N}_{out}$. These events are grouped into clusters, where events with the same entity type and operation are merged to reduce the graph's complexity. Each cluster represents an execution partition, and FALCON utilizes the reachability of information to associate $in$ and $out$ events within a partition. In each partition, the occurrence time of $in$ events should precede that of $out$ events to ensure that information does not flow from the future to the past. Finally, by partitioning these long-running processes and merging redundant events, the raw provenance graphs are transformed into the behavior provenance graphs that succinctly describe the behaviors.

\subsection{Behavior Sequence Extraction}
\label{B.3}
The categorization of high-frequency and low-frequency events is classified based on the average frequency. The frequency of a system event is obtained by calculating the proportion of events of that type to all events, formalized as fellow:
\begin{equation}
    Freq=\ln (Times(event_{type})/Times(event_{all})),
\end{equation}
where $event_{type}$ represents a specific type of event, essentially the events that has been processed to remove noise information. In audit data, the preceding and succeeding events of low-frequency events are also low-frequency. Consequently, when traversing from a low-frequency system event as the root node, low-frequency events are chosen as the traversal paths. Similarly, when traversing from a high-frequency system event as the root node, high-frequency events are selected as the traversal paths. The constructed behavior sequence can be represented as $Seq_B=\{event_{prec_m},\dots,event_{prec_2},event_{prec_1},event_0,event_{succ_1}, \allowbreak event_{succ_2},\dots,event_{succ_n}\}$. The timestamps of events in the behavior sequence are monotonically increasing.

Although FALCON has represented heterogeneous audit data using platform-independent origin graphs, there are still semantic differences in the data. These differences primarily stem from device information and operation systems. In addition, artificially named file names are noise, which will affect the behavior pattern learning at the sequence level. Therefore, FALCON utilizes lemmatization techniques to remove noise from entities in the behavior sequence and map semantically different entities to the same semantic layer. FALCON employs lemmatization rules proposed in our previous work \citep{li2023conlbs}. Entities of process type use the process name as the semantic description. Entities of file type use the content or type of the file record as the semantic description. For example, .py and .java are mapped to code file. Socket-type files use the IP address as the semantic description for that entity.

\subsection{MEP \& SHP}
\label{B.5}
\textbf{[MEP]}.The basic idea behind MEP is similar to MLM in BERT, that is, it uses [MASK] to randomly mask the tokens in the sequence, and predicts the masked tokens based on bidirectional context information. Specifically, 15\% tokens are randomly selected in the input sequence $Seq_{toks}=[{tok}_1,{tok}_2,\dots,{tok}_m,\dots,{tok}_n]$, and among the selected tokens, 80\% probability is replaced by [MASK], 10\% probability is randomly replaced by other tokens with same type $tok_x$, and 10\% probability is left unchanged. Unlike MLM, MEP does not replace the tokens entirely at random. The tokens selected for replacement should share the same type. For example, a token describing a process should be replaced with another process token. The input behavior sequence is converted into $Seq^{mask}_{toks}=[{tok}_1,[MASK],\dots,{tok}_x,\dots,{tok}_n]$. 

\textbf{[SHP]}.FALCON introduces the SHP task for sequence-level representation learning. Positive examples are generated by pairing two sequences, both consisting of either low-frequency or high-frequency events, and originating from the same BPG. Negative examples are created by pairing two sequences from different BPGs. Positive and negative examples are sampled with equal probability The fundamental idea behind this task is that behavior sequences from the same origin graph may exhibit potential causal relationships. These relationships include scenarios where the actions in one sequence serve as prerequisites for the operations in another sequence (sequential), or where the actions in both sequences collaborate to achieve a user's or attacker's objective (parallel).
\subsection{Fine-tune}
\label{B.6}
\textbf{Supervised Fine-tune.} In the pretraining phase, the model has already learned patterns of behavior sequences, but it lacks guidance on how to differentiate attack behavior sequences. In real-world IoT environments, information systems often record TDS alert information analyzed by security analysts and label logs associated with attacks. We can leverage this labeled data for fine-tuning the model, simultaneously learning patterns of both attack and normal behavior sequences. Specifically, attack investigation is a binary classification task, so we add a linear classifier to the output layer of the pretrained model. Labeled behavior sequences are then input into the adapted model for training, resulting in an attack investigation model.

\textbf{Unsupervised Classification.} Similar to previous works, we employ One-Class Support Vector Machine (OC-SVM) for unsupervised classification training. OC-SVM learns patterns of normal behavior sequences in the embedding space and trains a decision boundary suitable for the training data. In the context of attack investigation, behavior sequences that fall outside this boundary are classified as attack sequences.

\section{Addition Experiments and Details}
\label{E}
\subsection{Environment Supplement}
\label{E.1}
In order to verify FALCON's performance in IoT attack investigation, We design a real controllable IoT environment. To eliminate unpredictable factors, all behaviors and data in the system are transparent and controllable. We deployed TDS such as IDS and firewalls to detect attacks and generate alarms to restore the real IoT operating environment. Generated alarms and audit logs collected from terminals and servers are aggregated on independent GPU servers for analysis. Before simulating attacks, we replicated several typical IoT security issues in the devices. Specifically, 4 IoT devices used weak and default passwords, 3 IoT devices had known vulnerabilities \citep{sophos2022}, and two servers had system vulnerabilities exploitable by malicious software \citep{mitre2020}. Three attacks utilized IoT devices as initial access points, while two used phishing emails to access terminal servers and execute lateral movement.

\subsection{Datasets Extension}
\label{E.2}
A prevalent challenge in APT attack traceability analysis is the scarcity of publicly available attack datasets and well-annotated audit logs. Table \ref{tab:table1} statistics the number and features of system events, entities, and incident alarms in different attack scenarios.

%Table generated by Excel2LaTeX from sheet 'Sheet2'


\begin{table}[htb]
  \centering
  %\vskip -0.1in
  \caption{Overview of simulated iot attack scenarios. In the attack Features, \textbf{SA} indicates that the attack involves a server, \textbf{IoT EA} indicates that the attack involves an IoT edge devices, \textbf{LM} indicates that the attacker moves laterally inside the system, and \textbf{C\&C} indicates that the attack involves establishing a connection with a C\&C server.}
  %\vskip 0.1in
    \begin{tabular}{lccccccccc}
    \toprule
    \multicolumn{1}{c|}{\multirow{2}[2]{*}{Attack Scenarios}} & \multicolumn{4}{c|}{Attack Features} & \multicolumn{1}{c|}{\multirow{2}[2]{*}{\makecell{Number \\ of Devices}}} & \multicolumn{1}{c|}{\multirow{2}[2]{*}{Size(GB)}} & \multicolumn{3}{c}{Alarms and Events} \\
    \multicolumn{1}{c|}{} & \multicolumn{1}{c}{SA} & \multicolumn{1}{c}{IoT EA} & \multicolumn{1}{c}{LM} & \multicolumn{1}{c|}{C\&C} & \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} & \multicolumn{1}{p{4.11em}}{\#Alarms} & \#Events & \multicolumn{1}{p{4.11em}}{\#Entity} \\
    \midrule
    OceanLotus &   $\checkmark$    &       &      &   $\checkmark$    & 1     & 1.49  & 55    & 792.2K & 83,984 \\
    APT28 &   $\checkmark$    &       &      &   $\checkmark$    & 1     & 1.67  & 47    & 846.4K & 79,325 \\
    Kimsuky &   $\checkmark$    &       &  $\checkmark$    &   $\checkmark$    & 2     & 1.32  & 58    & 813.9K & 75,687 \\
    attack1 &   $\checkmark$    &   $\checkmark$    &   $\checkmark$    &       & 3     & 3.28  & 31    & 1,763.7K & 186,355 \\
    attack2 &   $\checkmark$    &   $\checkmark$    &   $\checkmark$    &   $\checkmark$  & 3     & 2.54  & 27    & 1,459.3K & 112,329 \\
    \midrule
    Total & -     & -     &    -   &   -   & 10     & 10.3  & 218   & 5,675.5K & 537,680 \\
    \bottomrule
    \end{tabular}%
  \label{tab:table1}%
  %\vskip -0.1in
\end{table}%

Two widely used datasets are used to evaluate the generalizability of FALCON. The ATLAS dataset was provided by reference \citep{alsaheel2021atlas}, which contained 10 simulated APT attacks with different vulnerabilities and different attack strategies. The CADETS dataset \citep{torrey2020} is released by the DARPA Transparent Computing program. The dataset was collected from hosts during DARPA's two-week red team vs. blue team. This dataset included attacks against the FreeBSD system, which is an open-source system used in some high-performance servers for the IoT. The attribute information of three datasets is shown in Table \ref{tab:table2}.

% Table generated by Excel2LaTeX from sheet 'Sheet5'
\begin{table}[htb]
    \centering
    \caption{Attribute information of three datasets.}
     %\vskip 0.1in
      \begin{tabular}{ccccc}
      \toprule
      \textbf{Datasets} & \multicolumn{1}{c}{\textbf{Scenarios}} & \textbf{Size} & \multicolumn{1}{c}{\textbf{Entities}} & \textbf{Events} \\
      \midrule
      IoT dataset & 5     & 10.3GB & 537,680 & 5,675.5K \\
      ATLAS \citep{alsaheel2021atlas} & 10    & 6.43GB & 200,884 & 2,488.2K \\
      CADETS \citep{torrey2020} & 3     & 35.7GB & 986,139 & 41,350.9K \\
      \bottomrule
      \end{tabular}%
    \label{tab:table2}%
  \end{table}%
  
\subsection{Evaluation Indicator}
\label{E.3}
Error conditions include False Positives (FP) and False Negatives (FN). FP represents classifying false alarms as true alarms, or classifying normal events as attack events. FN represents classifying ture alarms as false alarms or attack system events as normal.
\begin{align}
    Precision &= \frac{TP}{TP+FP},\\
    Recall &= \frac{TP}{TP+FN},\\
    F1\!-\!score &= \frac{2\times Precision \times Recall}{Precision+Recall},
\end{align}

\subsection{Complementary Explanation}
\label{E.4}
\textbf{Error of the "Alerts Investigation Result".} Attackers may try using certain commands to test if the access is successful, during the initial access phase. When the test shows access failure (triggering an alarm), they may change their attack strategy, adopting a new initial access method, and therefore, not proceed with subsequent attacks. Currently, most source analysis methods find it challenging to detect attack behaviors with initial access failures. We will attempt to address this issue in future research.

\textbf{Error of the "Events Investigation Result".} Unsuccessful execution of these attacks in the initial stages is attributed to certain environmental configurations during the simulated attack process. Another reason for misclassification is the provenance graph partition and optimization method. When benign events and attack events occur close in time and with similar operations, some benign events may be partitioned into the provenance subgraph describing attack behaviors. This situation does not lead to an increase in false negatives but results in a few benign events being identified as attack events, which is acceptable in practical APT attack investigations.


\subsection{Implementation Details}
\label{E.5}
\textbf{(1) Graph Partition and Optimization.} After constructing the raw provenance graphs from audit logs, FALCON proposes a graph partitioning and optimization method to eliminate errors in dependencies and redundant events caused by the redundancy in audit logs. Figure \ref{fig:figureA2} illustrates the optimization effects of FALCON on three datasets. FALCON, on average, reduces the number of system events in the three datasets by 88\% and partitions large and complex origin graphs into more accurately described behavior provenance graphs. The proposed approach of obtaining behavior origin graphs through partitioning and optimization enhances the efficiency and performance of attack investigations. 



\begin{minipage}[t]{0.48\textwidth} % left 45%
    \centering
    \includegraphics[width=\linewidth]{Figs/A-figure2.png} 
    \captionof{figure}{Comparison results of system events reduction on three datasets, in terms of IoT Malicious, ATLAS, and CADETS.}
    \label{fig:figureA2}
        % \caption{The architecture of the proposed model.}  
\end{minipage}
\hfill
\begin{minipage}[t]{0.48\textwidth} % right45%
    \centering
    \includegraphics[width=\linewidth]{Figs/A-figure3.png} 
    \captionof{figure}{Attack investigation results of different classifiers. Spectral Cluster, HDBSCAN and OC-SVM (RBF) are unsupervised classifiers, and Sigmoid (MLP), Linear, SVM (RBF) are used for supervised fine-tuning.}
    \label{fig:figureA3}
    % \caption{The architecture of the proposed model.}   
\end{minipage}


During the experiments, directly working with the original origin graph imposes high hardware requirements and frequently leads to memory or GPU memory overflow issues. This is attributed to the large number and relatively long average length of constructed behavior sequences. Another factor is the larger vocabulary generated during the embedding process for behavior sequences without lemmatization.

\textbf{(2) Pre-training.} 
While it's possible to enhance ATLAS's attack investigation performance by increasing the number of samples, obtaining a large quantity of high-quality annotated samples for training is challenging in the real world. To assess the efficiency of FALCON in attack investigation, we compared FALCON with four typical deep learning models on three datasets, including Convolutional Neural Network (CNN) \citep{zhang2015sensitivity}, Long Short-Term Memory (LSTM) \citep{memory2010long}, and two state-of-the-art pre-trained models for sequence analysis and natural language processing, BERT and RoBERTa. We used Word2Vec \citep{church2017word2vec} to convert behavior sequences into feature vectors for CNN and LSTM inputs. BERT and RoBERTa used the same fine-tuning strategy as described in this paper. CNN performed the worst among the three datasets due to limitations in the convolutional kernel and window size, preventing it from learning complete features of longer behavior sequences. Although LSTM addresses issues such as gradient vanishing and exploding during the training process for long sequences, it cannot handle cases where attackers share entities with regular users. Moreover, both CNN and LSTM utilize supervised learning, making it challenging to achieve good performance with limited labeled data.

\textbf{(3) Downstream task classifiers.} Based on the availability of labeled data, FALCON employs different classifiers for implementing the downstream attack investigation task. Specifically, when high-quality labeled data is not available, FALCON utilizes HDBSCAN for unsupervised downstream task training. When labeled data is available, FALCON employs MLP as the classifier for fine-tuning the downstream attack investigation task. To illustrate that the chosen classifiers are more suitable for IoT attack investigation, we compare the performance of several typical unsupervised and supervised classifiers. The unsupervised classifiers include spectral clustering, HDBSCAN, and OC-SVM with RBF kernel used by AIRTAG, while the supervised classifiers include an MLP with a Sigmoid activation function, a linear classifier, and the SVM with RBF kernel.

In the supervised fine-tuning, we use 1000 attack and normal behavior sequences each for training the model. The experimental results, as shown in Figure \ref{fig:figureA3}, indicate that the linear classifier performs significantly lower than other classifiers because the task is a non-linear classification task. In supervised classification, Sigmoid slightly outperforms SVM, with \textit{Precision}, \textit{Recall}, and \textit{F1-score} being $96.25\%$, $97.06\%$, and $96.65\%$. The performance of the three unsupervised classifiers is similar, with HDBSCAN achieving \textit{F1-score} values $1.48\%$ and $1.38\%$ higher than spectral clustering and OC-SVM, respectively. It can be observed that the performance of unsupervised classifiers is slightly lower than the supervised approach.

\subsection{Case Study}
\label{E.6}
An attack case study was conducted to demonstrate the effectiveness of FALCON in IoT attack investigation. Kimsuky is an APT group known for orchestrating sophisticated attacks on industrial IoT systems. In this particular case, Kimsuky deceived users into downloading a malicious zip file from the internet. Upon automatic extraction, the group leveraged process hollowing techniques to evade TDS and acquire sensitive information.An alert ($Alert_1$) was triggered by the TDS because the suspicious file (scr file) was written to the hard drive. But hollowed process explorer.exe did not trigger any alert. In addition, an illegal operation performed by a normal user is simulated, which triggered another alert ($Alert_2$). The upper part of Figure \ref{fig:figureA4} shows two alerts and their context information. $Alert_1$ is a true alert, and the provenance graph containing $Alert_1$ describes the process of the attack case. Meanwhile, $Alert_2$ is a false alert that describes the process of elevating permission for configuration files.

\begin{figure}[!hbt]
    %\vskip -0.1in
    \centering
    \includegraphics[scale=0.75]{Figs/A-figure4.png}
    \caption{A case study of FALCON illustrates the process of investigating alerts and system events, and reconstructing attack scenarios.}
    \label{fig:figureA4}
    %\vskip -0.1in
\end{figure}

Firstly, FALCON constructed the provenance subgraphs (step S1). Taking the corresponding system events of $Alert_1$ and $Alert_2$ as the root nodes, FALCON obtain the context information (system events) and construct them into behavior sequences. It should be noted that $\langle$ explorer.exe execute encryptor.exe$\rangle $ is a system event of low frequency (Step S2). The behavior sequences are tokenized and input them into the trained model to predict (Step S3). The model outputs the classification result of the behavior sequences (S4). According to the results, system events are associated to realize the reconstruction of the attack scenarios (S5). The behavior sequence constructed from the low-frequency events contains the three normal entities and relationships on the left of Figure 14. FALCON classify this behavior sequence as an attack behavior sequence because most of the context in the sequence are not changed. This low-frequency system event is reconstructed in the same attack scenario as $Alert_1$ based on dependencies. Finally, FALCON judges that $Alert_1$ is a true alert and $Alert_2$ is a false alert, and outputs an attack scenario after analyzing all system events.

This work primarily focuses on detecting sophisticated attacks originating from external sources target vulnerabilities within the system or trick users into downloading malicious files to compromise IoT systems. Attacks directed at the system kernel are beyond the scope of this study. FALCON employs low-frequency events to build behavior sequences from the behavior provenance graphs. However, this approach may lead to the misclassification of some low-frequency normal events as attack events, such as infrequent policy violations in a system. Although these events are anomalous, they may not compromise system security or compromise information confidentiality.


\end{document}
