% \documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


%%%%%%%%%% Package we used in this paper
\usepackage{Definitions}
\usepackage{wrapfig}
\usepackage{graphicx}
\usepackage{amsfonts}   
\usepackage{float}
\usepackage{bbm}    
\usepackage{amsmath}
\usepackage{empheq}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{makecell}
\usepackage{enumerate}
\usepackage{natbib}
\usepackage{pifont}
\usepackage{colortbl}
\usepackage{titlesec}

%%%%%%%%%% Define color
\definecolor{bestCombine}{RGB}{220,230,255}  % light blue
\definecolor{ours}{RGB}{255,220,220} % light red


\title{Flow-Based Delayed Hawkes Process}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2025 paper}{Chao Yang}{}}
\author[1]{Chao Yang}
\author[1]{Wendi Ren}
\author[1]{Shuang Li\thanks{Corresponding Author: lishuang@cuhk.edu.cn.}}
% \author[1]{Shuang Li$^{\dagger, }$}

% Add affiliations after the authors
\affil[1]{%
    School of Data Science \\
    The Chinese University of Hong
Kong (Shenzhen) \\
    China
}

  \begin{document}
\maketitle
% \def\thefootnote{$\dagger$}\footnotetext{Corresponding Author: lishuang@cuhk.edu.cn.}

\begin{abstract}
Multivariate Hawkes processes are classic temporal point process models for event data. These models are simple and parametric in nature, offering interpretability by capturing the triggering effects between event types. However, these parametric models often struggle with low model capacity, limiting their expressive power to capture heterogeneous data patterns influenced by latent variables. In this paper, we propose a simple yet powerful extension: the Flow-based Delayed Hawkes Process, which integrates Normalizing Flows as a generative model to parameterize the Hawkes process. By generating all model parameters through the flow-based network, our approach significantly improves flexibility and expressiveness while preserving interpretability. We provide theoretical guarantees by proving the identifiability of the model parameters and the consistency of the maximum likelihood estimator under mild assumptions. Extensive experiments on both synthetic and real-world datasets show that our model outperforms existing baselines in capturing intricate and heterogeneous event dynamics.
\end{abstract}
\titlespacing*{\section}
{0pt}{1ex plus 1ex minus .2ex}{0.5ex plus .2ex}
\titlespacing*{\subsection}
{0pt}{1ex plus 1ex minus .2ex}{0.5ex plus .2ex}
\titlespacing*{\subsubsection}
{0pt}{1ex plus 1ex minus .2ex}{0.5ex plus .2ex}
\setlist[enumerate]{itemsep=0ex,topsep=0ex}
\setlist[itemize]{itemsep=0ex,topsep=0ex}

\setlength{\abovedisplayskip}{1pt}
\setlength{\abovedisplayshortskip}{0pt}
\setlength{\belowdisplayskip}{1pt}
% \setlength{\belowdisplayshortskip}{0pt}
% \setlength{\jot}{0pt}
\setlength{\floatsep}{1.5ex}
\setlength{\textfloatsep}{1.5ex}
% \setlength{\intextsep}{0.3ex}
% \setlength{\topsep}{0ex}
% \setlength{\partopsep}{0ex}
\setlength{\parskip}{0.5ex}


\section{Introduction}
\label{sec:intro}
Complex systems often produce voluminous event data with {\it stochastic} and {\it irregularly-spaced} occurrence times. Temporal point process (TPPs) provide an elegant tool for modeling the dynamics of these event sequences in {\it continuous time}, which directly treat the inter-event time as random variables \citep{daley2003introduction}. Among various TPPs, Hawkes process is a classic and transparent model, with intensity functions are designed to capture the {\it triggering effects} from previous events. The intensity functions capture self-exciting and mutual-exciting triggering effects across event types, which can be interpreted—under certain conditions—as forming a Granger causality graph that reflects the underlying temporal dependencies~\citep{eichler2017graphical,gao2021causal}. 
% There is a surging interest in understanding the Granger causality, i.e., which event triggers which event, and estimating the causal effect based on the multivariate Hawkes processes. 

\begin{figure}[ht] 
\centering 
\includegraphics[width=0.45\textwidth]{UAI 2025 Camera-Ready/Fig/diff_standard_and_delayed.pdf}
\caption{Illustration of a multivariate Hawkes process without ({\bf top}) and with ({\bf bottom}) delay effects. In the bottom case, the time lag $\delta_{uu'}$ captures the delayed triggering from dimension $u'$ to $u$—that is, an event in $u'$ affects the intensity of $u$ only after a delay of $\delta_{uu'}$, not immediately.}
\label{fig:diff_standard_and_delayed_main_text} 
\end{figure}

Hawkes processes are widely used with exponential kernels due to their simplicity and interpretability. To enhance modeling flexibility, various extensions have introduced alternative triggering kernels—such as nonparametric~\citep{eichler2017graphical} and Gaussian mixture kernels~\citep{xu2016learning}. While these parametric or nonparametric variants improve expressiveness, they often struggle to capture the \emph{heterogeneous triggering patterns} found in real-world data. To address this, \emph{neural-based Hawkes processes} have been proposed~\citep{du2016recurrent,mei2017neural,zuo2020transformer}, offering increased expressiveness through data-driven modeling. Yet, their black-box nature sacrifices interpretability, raising important concerns about the trade-off between flexibility and transparency.

\emph{Can we enhance model capacity while preserving interpretability through a simple, plug-and-play approach?}

To answer this, we propose a novel framework that leverages Normalizing Flows (NFs)~\citep{dinh2016density,papamakarios2021normalizing} to {\it model distributions over Hawkes process parameters}. Our approach achieves both flexibility and interpretability by combining a simple, parametric {\it main model} with expressive, data-driven {\it parameter generation}. Specifically, the main model is a delayed Hawkes process featuring time lags, where all parameters—including the base intensity and those of the exponential triggering kernel—are generated by a NF. To further enhance model capacity and mitigate mode collapse, we introduce an ensemble of multiple NFs, allowing the model to capture a broader range of behaviors.

Specifically, the {\bf main model} adopts an exponential triggering kernel with delay:
\[
g(t) = \alpha \exp(-\beta (t - \delta)) \mathbbm{1}\{t - \delta \geq 0\},
\]
where $\alpha > 0$ controls the strength of the triggering effect, $\beta > 0$ governs how rapidly the effect decays, and $\delta \geq 0$ introduces a delay before the effect begins. Figure~\ref{fig:diff_standard_and_delayed_main_text} illustrates the difference between a standard multivariate Hawkes process and our delayed variant. While traditional models assume immediate triggering, the delay parameter $\delta$ captures the time lag between an event and its influence on future occurrences. This modeling capability is essential in many real-world settings—for instance, during the COVID-19 pandemic, the concept of {\it incubation periods} helps explain why symptoms and infectiousness appear only several days after exposure~\citep{quesada2021incubation, koyama2021estimating}. Similarly, in chronic diseases such as cancer, environmental exposures or genetic mutations may not manifest clinically until years later. By explicitly modeling such delays, we enhance both the realism and expressiveness of the temporal process.

Our framework leverages flow-based generative models to generate Hawkes process parameters, allowing the model to flexibly capture {\it heterogeneous dynamics} driven by unobserved or latent factors. In domains like healthcare, hidden variables—such as patient age, comorbidities, drug resistance, or psychological status—can cause wide variation in how interventions affect outcomes. A fixed parameter model cannot account for this variability. Instead, our method learns a {\it rich, joint distribution} over all the key parameters of the Hawkes process, capturing both their marginal variability and their interdependencies. Using deep NFs, we are able to model complex, multimodal parameter distributions, which enables the system to represent diverse behaviors across individuals or subpopulations. 

In summary, our contributions are threefold:\\
\emph{i)} We propose a simple yet flexible framework that generates Hawkes process parameters using a deep generative model, allowing the capture of heterogeneous triggering patterns, including delay effects that are often overlooked.\\
\emph{ii)} We provide theoretical analysis showing identifiability of the parameter distribution under our model and the consistency of the estimator.\\
\emph{iii)} Empirical results on both synthetic and real-world datasets demonstrate the competitive performance of the model and the ability to handle complex event dynamics.


\section{Related Work}
\label{sec:related_work}
\paragraph{Temporal Point Process (TPPs)} provide a principled framework for modeling the timing of discrete events in continuous time. Among them, the Hawkes process \citep{hawkes1971spectra, xu2016learning} is one of the most widely used, particularly for inferring inter-type Granger causality \citep{granger1969investigating, dahlhaus2003causality}. The classic Hawkes model assumes that past events independently and additively increase the intensity of future events through a set of pairwise kernel functions. While the exponential kernel is most common, several studies have explored alternative parametric forms to increase modeling flexibility, such as the Gamma kernel~\citep{lesage2022hawkes}, Weibull kernel~\citep{zhang2020survival}, and power-law kernel~\citep{zhang2016modeling}.

More recently, neural-based TPP models have been proposed to improve expressiveness by parameterizing the intensity function directly with deep networks. These approaches include RNN- and LSTM-based models~\citep{du2016recurrent, mei2017neural, xiao2017modeling, mei2020neural} and Transformer-based architectures~\citep{zuo2020transformer, zhang2020self, zhu2021deep, yang2021transformer}. However, {\it these methods primarily focus on directly approximating the intensity function, which limits their ability to recover meaningful insights such as Granger causality or delay effects}.

In contrast, our work focuses on modeling the full distribution over the parameters of a Hawkes process. This {\it distributional view} enables the model to capture heterogeneous triggering patterns, while maintaining interpretability through the use of a simple parametric backbone.

\paragraph{Parameter Estimation for TPPs} has been explored from both frequentist and Bayesian perspectives. Classical approaches such as maximum likelihood estimation (MLE)~\citep{lewis2012self} and the EM algorithm~\citep{lewis2011nonparametric, wheatley2014estimation} {\it typically yield point estimates of model parameters}. Kernel-based and other nonparametric methods~\citep{zhou2013learning, joseph2020shallow, kirchner2017estimation, eichler2017graphical} estimate intensity functions or kernel shapes without assuming specific parametric forms, {\it but generally provide functional or point estimates rather than full parameter distributions}.

Bayesian methods~\citep{zhang2018efficient, santos2023bayesian} aim to infer posterior distributions over parameters, offering uncertainty quantification and limited modeling of heterogeneity. {\it However, these approaches often rely on simplifying assumptions or approximations that restrict their ability to capture complex, multimodal structures}. 

More recently, generative models such as hypernetworks~\citep{dubey2022hyperhawkes, dubey2023time}, variational autoencoders (VAEs)~\citep{mehrasa2019variational}, and NFs~\citep{mehrasa2019point, shchur2019intensity} have been applied to TPPs—{\it primarily for modeling latent dynamics or inter-event time distributions}—{\it rather than directly learning distributions over the underlying process parameters}.


In contrast, our approach explicitly learns flexible joint distributions over key Hawkes process parameters—base intensity, triggering strength, decay rate, and delay—using NFs trained via a differentiable maximum likelihood objective. This enables direct optimization over expressive parameter families, capturing rich, multimodal patterns and better reflecting heterogeneity in real-world temporal dynamics.


\section{Model: Flow-Based Delayed Hawkes Processes}
\label{sec:main_model}

\begin{figure*}
\centering 
\includegraphics[width=1.0\textwidth]{UAI 2025 Camera-Ready/Fig/model_framework_delayed_hawkes_new.pdf}
\caption{Model framework: the normalizing flow ensembles with mixture weights are presented in dashed boxes.}
\label{fig:model_framework}
\end{figure*}
Consider a $U$-dimensional temporal point process with event sequences $\left\{N_u(t)\right\}_{u=1}^U$, where $N_u(t)$ denotes the number of events in dimension $u$ up to time $t$. The corresponding event histories are defined as
$$
\mathcal{H}_t=\left\{t_n^u: 1 \leq n \leq N_u(t), u=1, \ldots, U\right\}.
$$
In our interpretable {\bf main model}, the conditional intensity for dimension $u$ is defined by a Hawkes process with delayed triggering and exponentially decaying kernels:
\begin{align}
&  f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right) =\mu_u+ \nonumber \\
& \sum_{u^{\prime}=1}^U \sum_{n=1}^{N_u(t)} \alpha_{u u^{\prime}} e^{-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)} \mathbbm{1}\{t-t_n^{u^{\prime}}\geq\delta_{u u^{\prime}}\}
\label{eq:intensity_haw}
\end{align}
where $\mu_u \in \mathbb{R}^{+}$ is the base intensity at which events occur spontaneously, $\alpha_{u u^{\prime}} \geq 0$ (for all $u, u^{\prime} \in$ $[U]$) quantifies the strength of the triggering effect from events in dimension $u^{\prime}$ to events in dimension $u$, and $\beta>0$ controls the decay rate of this effect (with $\beta$ being shared across all event types). We further introduce $\delta_{u u^{\prime}} \geq 0$ (for all $u, u^{\prime} \in[U]$) to indicate the delay before the triggering effect becomes effective, such that the triggering kernel is active only when $t-t_n^{u^{\prime}} \geq \delta_{u u^{\prime}}$. Finally, we denote the complete set of parameters by
$$
\boldsymbol{\theta}:=\{\boldsymbol{\mu}, \boldsymbol{\alpha}, \beta, \boldsymbol{\delta}\}
$$
where $\boldsymbol{\mu}:=\left[\mu_u\right], \boldsymbol{\alpha}:=\left[\alpha_{u u^{\prime}}\right]$, and $\boldsymbol{\delta}:=\left[\delta_{u u^{\prime}}\right]$.
 
To capture heterogeneity in the dynamics of the event sequences, we extend the main model by assuming that the parameters are not fixed but are drawn from a learnable distribution $p(\boldsymbol{\theta})$, i.e., $\boldsymbol{\theta} \sim p(\boldsymbol{\theta})$, modeled via a NF.
Accordingly, the expected (or marginal) intensity function becomes
\begin{align}
\lambda_u\left(t \mid \mathcal{H}_t ; p(\boldsymbol{\theta})\right):=\mathbb{E}_{\boldsymbol{\theta} \sim p(\boldsymbol{\theta})}[f_u(t ; \boldsymbol{\theta})]
\label{eq:marginal_intensity}
\end{align}
where we denote $f_u(t ; \boldsymbol{\theta}):= f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)$ for notation simplicity. This formulation marginalizes over a learned parameter distribution rather than relying on a fixed setting. It naturally captures heterogeneity across sequences, as different parameter samples induce different dynamics. Effectively, it acts like a mixture of Hawkes processes—each sample defines a component—allowing the model to flexibly represent diverse triggering patterns while retaining interpretability.
\paragraph{Modeling Joint Dependencies with NFs}
We explicitly model the {\bf joint distribution} of the parameters $\boldsymbol{\theta}$ using a NF that captures their inherent dependencies. The concatenated main model parameters will be $\boldsymbol{\theta} \in \mathbb{R}^d$, where $d=2U^2+U+1$.

We assume a latent variable $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_d)$ and learn an invertible transformation:
$$
\boldsymbol{\theta}=F_{\boldsymbol{\phi}}(\boldsymbol{\epsilon}), \quad \text { where } F_{\boldsymbol{\phi}}: \mathbb{R}^d \rightarrow \mathbb{R}^d.
$$
Here, $F_{\boldsymbol{\phi}}$  is implemented using a flexible flow-based generative model (e.g., RealNVP~\citep{dinh2016density}, Glow~\citep{kingma2018glow}, or Neural Spline Flows~\citep{durkan2019neural}). Specifically, we define the transformation as a composition of $K$ invertible layers:
$$
F_{\boldsymbol{\phi}}=h_K \circ h_{K-1} \circ \cdots \circ h_1
$$
where each $h_k$ is an invertible mapping with a tractable Jacobian determinant. For example, in a RealNVP-style flow, each layer $h_k$ may be defined by an affine coupling layer. In such a layer, the input is split into two parts $\boldsymbol{u}$ and $\boldsymbol{v}$; then one updates the output as
$$
\begin{aligned}
\boldsymbol{u}^{\prime} & = \boldsymbol{u}, \quad
\boldsymbol{v}^{\prime} = \boldsymbol{v} \odot \exp \left(s_k(\boldsymbol{u})\right)+b_k(\boldsymbol{u})
\end{aligned}
$$
where $s_k(\cdot)$ and $b_k(\cdot)$ are neural networks parameterized by $\boldsymbol{\phi}$, and $\odot$ denotes elementwise multiplication. The invertibility of each $h_k$ is ensured, and the Jacobian determinant is easily computed since it is triangular.
Using the change-of-variables formula, the target density $p_{\boldsymbol{\phi}}(\boldsymbol{\theta})$ induced by the flow is given by:
$$
p_{\boldsymbol{\phi}}(\boldsymbol{\theta})=p_{\boldsymbol{\epsilon}}\left(F_{\boldsymbol{\phi}}^{-1}(\boldsymbol{\theta})\right) \cdot\left|\operatorname{det}\left(\frac{\partial F_{\boldsymbol{\phi}}^{-1}}{\partial \boldsymbol{\theta}}\right)\right|
$$
where $p_{\boldsymbol{\epsilon}}(\boldsymbol{\epsilon})$ is the density of the base multivariate Gaussian. In practice, we compute the inverse $F_{\boldsymbol{\phi}}^{-1}(\boldsymbol{\theta})$ layer by layer, and accumulate the log-determinants of the Jacobian matrices from each transformation. This construction allows us to evaluate the target density $p_{\boldsymbol{\phi}}(\boldsymbol{\theta})$ efficiently. 

Using a single NF can sometimes lead to {\it mode collapse}.  This is particularly problematic when modeling the joint distribution of the parameters in our Hawkes process, as the parameters (such as $\boldsymbol{\mu}$, $\boldsymbol{\alpha}$, and $\boldsymbol{\delta}$) often exhibit complex, multimodal dependencies reflecting heterogeneous triggering behaviors. We propose to use a mixture of NFs (depicted in Figure \ref{fig:model_framework}) to address this challenge by combining several component flows, each of which can specialize in capturing different modes of the distribution. Concretely, the target density of the Hawkes parameters is represented as
$$
p(\boldsymbol{\theta})=\sum_{m=1}^M \pi_m p_m(\boldsymbol{\theta})
$$
where each component $p_m(\boldsymbol{\theta})$ is modeled by its own NF, with parameters denoted as $\boldsymbol{\phi}_m$, and $\pi_m$ are the mixture weights (summing to 1). When computing the marginal intensity, the mixture formulation results in a weighted sum of expectations from each component (due to the linearity of expectation):
\begin{align}
\lambda_u\left(t \mid \mathcal{H}_t ; p(\boldsymbol{\theta}) \right)=\sum_{m=1}^M \pi_m \mathbb{E}_{\boldsymbol{\theta} \sim p_m(\boldsymbol{\theta})}\left[f_u(t ; \boldsymbol{\theta})\right]
\label{eq:intensity_mixture_NFs}
\end{align}
ensuring that all modes contribute proportionally to the final intensity function according to their mixture weights. 

Using the mixture model have been explored in generative models such as GANs, where multiple adversarial networks have been used to mitigate mode collapse and improve diversity~\citep{hoang2018mgan, nguyen2017dual, durugkar2016generative, mordido2020microbatchgan}. Similarly, \citet{berry2023normalizing} extended this idea to normalizing flows. Building on this, we incorporate a mixture of NFs to improve mode coverage in our flow-based delayed Hawkes process. 

\section{Model Learning} 
The overall framework is shown in Figure~\ref{fig:model_framework}, where the model first computes the marginal intensity function by averaging over sampled parameters from the flow-based generator, which is then used to evaluate the log-likelihood of the observed event sequences. Now the model parameters become $\boldsymbol{\phi}=[\boldsymbol{\phi}_m]$ and $\boldsymbol{\pi} = [\pi_m]\in \Delta^{M}$ (i.e., probability simplex) of the mixture NFs. We will learn $\boldsymbol{\phi}$ and $\boldsymbol{\pi}$ via maximizing the log-likelihood of the observed event sequences through:
\begin{align}
\max_{\boldsymbol{\phi}, \boldsymbol{\pi} \in \Delta^{M}} \mathcal{L}\left(p_{\boldsymbol{\phi}, \boldsymbol{\pi}} (\boldsymbol{\theta})\right)
\end{align}
where the log-likleihood $ \mathcal{L} \left(p_{\boldsymbol{\phi}, \boldsymbol{\pi}} \right)$ is computed as
\begin{align}
\sum_{u}\left[\sum_{n=1}^{N_u(T)} \log \lambda^*_u\left(t_n^u\right)-\int_0^T \lambda^*_u\left(t \right) d t\right]
\label{eq:log-likelihood}
\end{align}
and $ \lambda^*_u\left(t \right) := \lambda_u\left(t_n^u \mid \mathcal{H}_{t_n^u} ; p_{\boldsymbol{\phi}, \boldsymbol{\pi}}(\boldsymbol{\theta})\right)$ is the marginal intensity as defined in Eq.~(\ref{eq:intensity_mixture_NFs}) and $T$ is the time horizon.

To approximate the marginal intensity, we first draw samples from each NF component. For each component $m$, we generate samples
$$
\boldsymbol{\theta}^{(s)} \sim p_m(\boldsymbol{\theta})
$$
using the reparameterization $\boldsymbol{\theta}^{(s)}=F_{\boldsymbol{\phi}}(\boldsymbol{\epsilon}^{(s)})$ with $\boldsymbol{\epsilon}^{(s)} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_d\right)$. Since the model parameters $\boldsymbol{\mu}, \boldsymbol{\alpha}, \beta$ and $\boldsymbol{\delta}$ are constrained to be nonnegative, we modify the output of the NF by applying a softplus activation to the last layer so that the generated $\boldsymbol{\theta}$ always satisfies this nonnegativity condition. The expectation is then approximated via Monte Carlo, yielding the marginal intensity:
\begin{align}
\lambda_u\left(t \mid \mathcal{H}_t ; p(\boldsymbol{\theta})\right) \approx \sum_{m=1}^{M} \pi_m\left[\frac{1}{S} \sum_{s=1}^S f_u\left(t ; \boldsymbol{\theta}^{(s)}\right)\right].
\label{eq:finiteM}
\end{align}
\paragraph{Gradient Computation with Respect to $\boldsymbol{\phi}_m$} To compute the gradient of the log-likelihood with respect to the parameters $\boldsymbol{\phi}_m$ of the $m$-th NF component, we apply the reparameterization trick. For each sample $\boldsymbol{\theta}^{(s)}=F_{\boldsymbol{\phi}_m}\left(\boldsymbol{\epsilon}^{(s)}\right)$, where $\boldsymbol{\epsilon}^{(s)}\sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_d\right)$, we approximate the gradient as:
$$
\nabla_{\boldsymbol{\phi}_m} \mathbb{E}_{\boldsymbol{\theta} \sim p_m(\boldsymbol{\theta})}\left[f_u(t ; \boldsymbol{\theta})\right] \approx \frac{1}{S} \sum_{s=1}^S \nabla_{\boldsymbol{\phi}_m} f_u\left(t ; F_{\boldsymbol{\phi}_m}\left(\boldsymbol{\epsilon}^{(s)}\right)\right).
$$
This enables efficient end-to-end training by backpropagating through NF generator using automatic differentiation.
\paragraph{Handling the Non-differentiability of $\boldsymbol{\delta}$}
The intensity function is inherently non-differentiable with respect to the delay parameter $\boldsymbol{\delta}$ due to the indicator function in the triggering kernel (as shown in Eq.~(\ref{eq:intensity_haw})). To obtain gradient estimates for $\delta_{uu'}$, we approximate the indicator function with a smooth sigmoid function $\sigma(z)=\frac{1}{1+\exp (-\tau z)}$ where $\tau>0$ is a temperature parameter that controls the steepness of the sigmoid. In other words, we approximate
\begin{align}
\mathbbm{1}\left\{t-t_n^{u^{\prime}} \geq \delta_{u u^{\prime}}\right\} \approx \sigma\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right).
\label{eq:smooth}
\end{align}
This approximation makes intensity function differentiable with respect to delay parameters $\delta_{u u^{\prime}}$, enabling gradient-based optimization. Details can be found in Appendix \ref{appendix_subsec:dynamic_sigmoid_mask}.

\paragraph{Gradient Computation with Respect to $\boldsymbol{\pi}$}
For the mixture weights $\boldsymbol{\pi} \in \Delta^M$, we eliminate the probability simplex constraint by reparameterizing them via a softmax function. Define $
\pi_m=\frac{\exp \left(w_m\right)}{\sum_{j=1}^M \exp \left(w_j\right)}
$, where $w_m \in \mathbb{R}$ are unconstrained parameters. This reparameterization allows us to compute gradients with respect to $w_m$ (and hence $\boldsymbol{\pi}$) via standard backpropagation through the softmax, simplifying optimization over the simplex.

{\it Discussion: Frequentist v.s. Bayesian Approach?} While our method models distributions over Hawkes process parameters, it follows a {\bf frequentist approach}, not a Bayesian one. Instead of specifying priors and performing posterior inference, we {\bf learn the parameter distributions directly} by optimizing the log-likelihood of observed event sequences. Specifically, we first compute the marginal intensity function by averaging the Hawkes intensity over sampled parameters from a flow-based generator, and then use this marginal intensity to evaluate the log-likelihood. This formulation enables us to flexibly capture heterogeneity in temporal dynamics without relying on approximate Bayesian inference or prior assumptions.

\section{Theoretical Analysis}
We begin by establishing the identifiability of the fixed parameter \( \boldsymbol{\theta} \) in the delayed Hawkes process (Theorem~\ref{them_1}) and extend this to show that the distribution over parameters \( p(\boldsymbol{\theta}) \) is also identifiable (Theorem~\ref{them_2}). Unlike traditional approaches that assume fixed parameters, our method learns a distribution over parameters to capture population-level heterogeneity. Therefore, establishing the identifiability of \( p(\boldsymbol{\theta}) \) is critical to ensuring meaningful and interpretable inferences. Building on this, we prove the consistency of the maximum likelihood estimator (MLE) for \( p(\boldsymbol{\theta}) \) under standard regularity conditions (Theorem~\ref{them_3}). Together, these results form the theoretical foundation of our delayed Hawkes process framework—demonstrating why it is identifiable and statistically reliable in practice.

% We first establish the identifiability of fixed parameter $\boldsymbol{\theta}$ and extend this result to distribution $p(\boldsymbol{\theta})$. Then we analyze the consistency of the maximum likelihood estimator for $p(\boldsymbol{\theta})$ and the universal approximation error. The following theorems collectively justify \textbf{why (identifiability, Theorem~\ref{them_1} and Theorem~\ref{them_2})}, \textbf{under what conditions (consistency, Theorem~\ref{them_3})}, and \textbf{how (approximation, Theorem~\ref{them_4})} our framework works, ensuring theoretical rigor that guides practical design.

\subsection{Identifiability} 
We first establish the identifiability of the fixed parameters $\boldsymbol{\theta}$ in the delayed Hawkes process. That is, the model parameters can be uniquely recovered from the conditional intensity functions under mild and practically reasonable assumptions.

\begin{theorem}[Identifiability of fixed $\boldsymbol{\theta}$]
Let \(\mathcal{H}_t\) be a realization of the delayed multivariate Hawkes process as defined in Eq.~\eqref{eq:intensity_haw}. Suppose the conditional intensity functions satisfy
\begin{align}
f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) = f_u(t \mid \mathcal{H}_t; \boldsymbol{\tilde{\theta}}), \quad \forall u \in[U]
\label{eq:condition_1}
\end{align}
almost everywhere. Then, under the conditions listed below, it follows that $\boldsymbol{\theta} = \boldsymbol{\tilde{\theta}}$.
\label{them_1}
\end{theorem}

\begin{proof}
The proof proceeds in four steps:

(1) \textbf{Baseline rate $\mu_u$}: Integrating both sides of Eq.~\eqref{eq:condition_1} over $[0, t_{(1)}]$, where $t_{(1)}$ is the first event time in $\mathcal{H}_t$, and using $t_{(1)} > 0$ almost surely, we obtain $\mu_u = \tilde{\mu}_u$.

(2) \textbf{Delays $\delta_{uu'}$}: Due to the exponential triggering kernel, each past event contributes a peak to the intensity at $t = t_n^{u'} + \delta_{uu'}$. If $\delta_{uu'} \ne \tilde{\delta}_{uu'}$, the peak locations differ, violating Eq.~\eqref{eq:condition_1}.

(3) \textbf{Decay rate $\beta$}: Differentiating both sides of Eq.~\eqref{eq:condition_1} yields $\beta (f_u(t) - \mu_u) = \tilde{\beta} (f_u(t) - \mu_u)$. Since $f_u(t) - \mu_u > 0$ with nonzero probability, we conclude $\beta = \tilde{\beta}$.

(4) \textbf{Triggering strengths $\alpha_{uu'}$}: With known $\mu_u$, $\delta_{uu'}$, and $\beta$, the equality of $f_u(t)$ implies $\alpha_{uu'} = \tilde{\alpha}_{uu'}$.
\end{proof}

\paragraph{Mild Conditions for Identifiability.}
The theorem holds under the following assumptions, which are easily satisfied in practice:

\begin{itemize}
    \item \textbf{Model assumptions:}
    \begin{itemize}
        \item[(i)] $\beta$ is shared across all $(u, u')$.
        \item[(ii)] Each $u$ has at least one $u'$ such that $\alpha_{uu'} > 0$.
        \item[(iii)] Delays $\delta_{uu'}$ are fixed and non-negative.
        \item[(iv)] $\mu_u > 0$ for all $u$.
    \end{itemize}
    \item \textbf{Data assumptions:}
    \begin{itemize}
        \item[(i)] Event times are continuous and distinct.
        \item[(ii)] For every $(u, u')$ with $\alpha_{uu'} > 0$, at least one event in $u'$ triggers an event in $u$ after delay $\delta_{uu'}$.
        \item[(iii)] $t_{(1)} > 0$ almost surely.
        \item[(iv)] Observation window $[0, T]$ is long enough to observe delayed interactions.
    \end{itemize}
\end{itemize}

These conditions ensure identifiability while being mild and verifiable in real-world applications. Violations (e.g., simultaneous events, zero delays, or degenerate parameters) may lead to non-identifiability.


In our method, rather than estimating a fixed parameter \( \boldsymbol{\theta} \), we aim to learn a distribution \( p(\boldsymbol{\theta}) \) to capture heterogeneous triggering patterns across event sequences. Therefore, establishing the identifiability of \( p(\boldsymbol{\theta}) \) is critical to ensure that our model learns a meaningful and unique distribution consistent with observed data.
% \subsection{Identifiability} 
% \begin{theorem}[Identifiability of fixed $\boldsymbol{\theta}$]
% Let \(\mathcal{H}_t\) be a realization of the multivariate delayed Hawkes process characterized by Eq.~\eqref{eq:intensity_haw}. Under mild conditions, if 
% \begin{align}
% f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) = f_u(t \mid \mathcal{H}_t; \boldsymbol{\tilde{\theta}}), \forall u \in[U]
% \label{eq:condition_1}
% \end{align}
% almost everywhere, then \( \boldsymbol{\theta} = \boldsymbol{\tilde{\theta}}\).
% \label{them_1}
% \end{theorem}

% \begin{proof} We will proceed the proof by showing that $\mu_u =\tilde{\mu}_u$, $\delta_{uu'} = \tilde{\delta}_{uu'}$, $\beta= \tilde{\beta}$, and $\alpha_{uu'}= \tilde{\alpha}_{uu'}$ in order. The following proof is for any $u$.

% Let $t_{(1)}$ be the first event time in $\mathcal{H}_t$. Since condition (\ref{eq:condition_1}) holds, integrating both sides over $\left[0, t_{(1)}\right]$ yields $\left(\mu_u-\tilde{\mu}_u\right) t_{(1)}=0$. Since $t_{(1)}>0$ almost surely, we conclude $\mu_u=\tilde{\mu}_u$, $\forall u\in [U]$.

% Because the arrival times of events are continuously distributed, for almost every event $t_n^{u^{\prime}}$ (the $n$-th event of process $u'$), the contribution of that event to the intensity will jump to peak at the moment $t=t_n^{u^{\prime}}+\delta_{u u^{\prime}}$ and strictly decrease after the peak. If $\delta_{u u^{\prime}} \neq \tilde{\delta}_{u u^{\prime}}$, the peaks of the intensities under different parameters will be misaligned, leading to a contradiction. A key property of point processes is that all arrival times such as $t_n^{u'}$ are distinct. This distinctness plays a crucial role in our proof: if two events shared the same arrival time, it would be ambiguous whether an event was triggered by one history or another. 

% Under condition (\ref{eq:condition_1}), differentiating the intensity function with respect to $t$ yields $\beta (f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta})- \mu_u )=\tilde{\beta}  (f_u(t \mid \mathcal{H}_t; \boldsymbol{\tilde{\theta}})- \mu_u ) $. Since $f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta})- \mu_u>0$, we conclude $\beta = \tilde{\beta}$. Substituting into $f_u(t)$ and integrating further ensures $\alpha_{uu'}=\tilde{\alpha}_{uu'}$. Thus, \( \boldsymbol{\theta} = \boldsymbol{\tilde{\theta}}\), completing the proof.
% \end{proof}

\begin{theorem}[Identifiability of \( p(\boldsymbol{\theta}) \)]  
Let \( \Theta \subset \mathbb{R}^d \) be the parameter space, and let \( f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) \) denote the conditional intensity function. Suppose the following conditions hold:  
\begin{itemize}
    \item[(i)] The mapping \( \boldsymbol{\theta} \mapsto f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) \) is injective for almost every \( t \in \mathbb{R}^+ \).  
    \item[(ii)] The function class \( \mathcal{F} = \{ f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta \} \) is complete, meaning that if a measurable function \( g: \Theta \to \mathbb{R} \) satisfies  
    \[
    \int_{\Theta} f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) g(\boldsymbol{\theta}) d \boldsymbol{\theta} = 0 \quad \text{for all } t,
    \]  
    then \( g(\boldsymbol{\theta}) = 0 \) almost everywhere on \( \Theta \).  
\end{itemize}

Then, if two distributions \( p(\boldsymbol{\theta}) \) and \( q(\boldsymbol{\theta}) \) induce the same marginal intensities (as defined in Eq.~(\ref{eq:marginal_intensity})):
\[
\lambda_u\left(t \mid \mathcal{H}_t ; p(\boldsymbol{\theta}) \right) = \lambda_u\left(t \mid \mathcal{H}_t ; q(\boldsymbol{\theta}) \right), \quad \forall u \in [U],
\]
for almost every \( t \), it follows that \( p(\boldsymbol{\theta}) = q(\boldsymbol{\theta}) \) almost everywhere on \( \Theta \).  
\label{them_2}  
\end{theorem}  
We have already established the injectiveness of the mapping \( \boldsymbol{\theta} \mapsto f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) \) in Theorem~\ref{them_1}. For the smoothed intensity function used in our model (Eq.~(\ref{eq:smooth})), we also prove the completeness of the function class \( \mathcal{F} \) (see Appendix~\ref{appendix:proof_them_2} and ~\ref{appendix:proof_completeness}). Together, these results satisfy the mild and practically realistic conditions required by Theorem~\ref{them_2}, thereby ensuring the identifiability of \( p(\boldsymbol{\theta}) \) in our delayed Hawkes framework.




% \begin{theorem}[Identifiability of \( p(\boldsymbol{\theta}) \)]  
% Let \( \Theta \subset \mathbb{R}^d \) be the parameter space, and let \( f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) \) be an intensity function. Suppose: 1) The mapping \( \boldsymbol{\theta} \mapsto f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) \) is injective for almost every \( t \in \mathbb{R}^+ \). 2) The function class \( \mathcal{F} = \{ f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta \} \) is complete, meaning that if a measurable function \( g: \Theta \to \mathbb{R} \) satisfies  
%     \[
%     \int_{\Theta} f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) g(\boldsymbol{\theta}) d \boldsymbol{\theta} = 0 \quad \text{for all } t,
%     \]  
%     then \( g(\boldsymbol{\theta}) = 0 \) almost everywhere on \( \Theta \).  
    
% Then if two distributions \( p(\boldsymbol{\theta}) \) and \( q(\boldsymbol{\theta}) \) induce the same marginal intensities (as in Eq.~(\ref{eq:marginal_intensity})):  
% \[
% \lambda_u\left(t \mid \mathcal{H}_t ; p(\boldsymbol{\theta}) \right) = \lambda_u\left(t \mid \mathcal{H}_t ; q(\boldsymbol{\theta}) \right), \quad \forall u \in[U],\]  
% for almost every $t$, we will have \( p(\boldsymbol{\theta}) = q(\boldsymbol{\theta}) \) almost everywhere on \( \Theta \) (See Appendix~\ref{appendix:proof_them_2} for the proof).
% \label{them_2}  
% \end{theorem}  
% Specifically, we have already shown the injectiveness of the mapping \( \boldsymbol{\theta} \mapsto f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) \) in Theorem~\ref{them_1}.

% Moreover, for the smooth approximated intensity (as defined in Eq.~(\ref{eq:intensity_haw}) and (\ref{eq:smooth})), we can prove the completeness of \( \mathcal{F} \) (analyzed in Appendix \ref{appendix:proof_completeness}). With both injectiveness and completeness established, given Theorem~\ref{them_2}, we conclude that \( p(\boldsymbol{\theta}) \) is identifiable in our delayed Hawkes process model.  

\subsection{Consistency}
In this paper, we learn the parameter distribution \(p(\boldsymbol{\theta})\) by maximizing the log-likelihood \(\mathcal{L}(p_{\boldsymbol{\phi}, \boldsymbol{\pi}})\) defined in Eq.~(\ref{eq:log-likelihood}). Therefore, establishing consistency of the MLE \(\hat{p}(\boldsymbol{\theta})\) is critical to guarantee that as the observation window \(T\) grows, our learned distribution converges to the true underlying distribution \(p^*(\boldsymbol{\theta})\). 
\begin{theorem}[Consistency of MLE \(\hat{p}(\boldsymbol{\theta})\)]  
Assume the true parameter distribution is \(p^*(\boldsymbol{\theta})\) and the Hawkes process model is correctly specified. Suppose the following conditions hold:  
\begin{itemize}  
\item The parameter space for \(p(\boldsymbol{\theta})\) is compact (or satisfies appropriate regularity conditions).  
\item The mapping \(\boldsymbol{\theta} \mapsto f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)\) is injective for almost every \(t\) (by Theorem~\ref{them_1}).  
\item The function class \(\mathcal{F} = \left\{ f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right) : \boldsymbol{\theta} \in \Theta \right\}\) is complete, ensuring identifiability of \(p(\boldsymbol{\theta})\) (by Theorem~\ref{them_2}).  
\item The empirical log-likelihood converges uniformly to its expectation as \(T \to \infty\) (via standard point process law of large numbers arguments).  
\end{itemize}  

 

Formally, the MLE \(\hat{p}(\boldsymbol{\theta})\) satisfies  
\[
\hat{p}(\boldsymbol{\theta}) \xrightarrow{p} p^*(\boldsymbol{\theta}) \quad \text{as} \quad T \to \infty.
\]  

That is, the MLE is consistent.  
\label{them_3}  
\end{theorem}  

Theorem~\ref{them_3} ensures that with sufficient data (\(T \to \infty\)), the MLE \(\hat{p}(\boldsymbol{\theta})\) converges in probability to the true distribution \(p^*(\boldsymbol{\theta})\). A proof sketch is provided in Appendix~\ref{appendix:proof_theorem_MLE}.




% \subsection{Consistency}
% \begin{theorem}[Consistency of MLE $\hat{p}(\boldsymbol{\theta})$ ] Assume that the true parameter distribution is $p^*(\boldsymbol{\theta})$ and that the Hawkes process model is correctly specified. Suppose that the following conditions hold. 
% \begin{itemize}
% \item The parameter space for $p(\boldsymbol{\theta})$ is compact (or appropriate regularity conditions are met). 
% \item The mapping $\boldsymbol{\theta} \mapsto f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)$ is injective for almost every $t$ (by Theorem~\ref{them_1}). 
% \item The function class $\mathcal{F}=\left\{f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right): \boldsymbol{\theta} \in \Theta\right\}$ is complete, ensuring the identifiability of $p(\boldsymbol{\theta})$ (by Theorem \ref{them_2}). 
% \item The empirical log likelihood converges uniformly to its expectation as $T \rightarrow \infty$ (via standard point process law-of-large-numbers arguments). 
% \end{itemize} 
% Then the MLE $ \hat {p}(\boldsymbol{\theta }) $ obtained by maximizing $\mathcal{L} \left(p_{\boldsymbol{\phi}, \boldsymbol{\pi}} \right)$ (defined in Eq.~(\ref{eq:log-likelihood})) satisfies
% $$
% \hat{p}(\boldsymbol{\theta}) \xrightarrow{p} p^*(\boldsymbol{\theta}) \quad \text {as}\quad T \rightarrow \infty .
% $$
% That is, the MLE is consistent. 
% \label{them_3} 
% \end{theorem}
% Theorem~\ref{them_3} ensures that with sufficient data ($T \rightarrow \infty$), the MLE $\hat{p}(\boldsymbol{\theta})$ converges to $p^{*}(\boldsymbol{\theta})$. The proof sketch can be found in Appendix \ref{appendix:proof_theorem_MLE}.
% \subsection{Universal Approximation}
% % \subsection{Mixture Universality for Hawkes Processes}
% A natural question we want to answer is, \emph{what benefits does this mixture of NFs bring for Hawkes process modeling}? The main conclusion is that although a fixed $\boldsymbol{\theta}$ (i.e., simple delayed Hawkes process model) is low in capacity, using the mixture of NFs as the distribution of $p(\boldsymbol{\theta})$  will {\it significantly} increase the model's expressive power. In particular, we show that the mixture of NFs can approximate the true marginal intensity function arbitrarily well, which is crucial when the underlying $p(\boldsymbol{\theta})$ is multimodal and heterogeneous.

% % Below is a theorem with a proof sketch that demonstrates the benefits of using a mixture of normalizing flows (NFs) to increase the expressive power of the Hawkes process model. % via a flexible prior. 

% \begin{theorem}[Universal Approximation]
% Let the true marginal intensity function for the $u$-th process be
% $
% \lambda_u^*\left(t \mid \mathcal{H}_t\right)=\int_{\Theta} f_u\left(t \mid \mathcal{H}_t ; \theta\right) p^*(\boldsymbol{\theta}) d \boldsymbol{\theta}
% $
% where $p^*(\boldsymbol{\theta})$ is the true latent parameter distribution over a compact parameter space $\Theta \subset \mathbb{R}^d$ and the smooth intensity function 
% $
% f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)
% $ (as defined in Eq.~(\ref{eq:intensity_haw}) and (\ref{eq:smooth})) is continuous in $\boldsymbol{\theta}$. Then, for any $\epsilon>0$ and any compact time interval $[0, T]$, there exists a mixture of normalizing flows
% $
% \hat{p}(\theta)=\sum_{m=1}^M \pi_m p_m(\theta)
% $
% such that the corresponding marginal intensity
% $
% \lambda_u\left(t \mid \mathcal{H}_t ; \hat{p}\right)=\sum_{m=1}^M \pi_m \mathbb{E}_{\boldsymbol{\theta} \sim p_m(\boldsymbol{\theta})}\left[f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)\right]
% $
% satisfies
% $$
% \sup _{t \in[0, T]}\left|\lambda_u\left(t \mid \mathcal{H}_t ; \hat{p}\right)-\lambda_u^*\left(t \mid \mathcal{H}_t\right)\right|<\epsilon.
% $$
% \label{them_4}
% \end{theorem}
% Theorem~\ref{them_4} implies the general ability of our proposed ensemble of NFs to approximate arbitrarily marginal intensity functions (for any dimension) of one Hawkes process. With enough mixture components, our model can be considered ``correctly specified'' for practical purposes, since approximation error can be made arbitrarily small. The proof sketch can be found in Appendix \ref{appendix:proof_universal}.
\section{Experiment}
\label{sec:experiment}
% To validate the performance and practical significance of our method, we initially assess on synthetic datasets with known time lag and other parameter distributions. Subsequently, we analyze two healthcare datasets to shed light on the often-overlooked delayed effects of medication strategies and disease prevention policies.

\subsection{Experimental Setup}
\paragraph{Baselines} We select several state-of-the-art baselines grouped by their evaluation focus:

\emph{(i) Parameter Distribution Learning Tasks}: This task aims to accurately learn the underlying distribution of model parameters \( p(\boldsymbol{\theta}) \). Hypernet~\citep{ha2016hypernetworks, chauhan2023brief} approaches this by training a hypernetwork to produce samples from the parameter distribution. The \(\beta\)-VAE~\citep{higgins2017beta} frames this as a Bayesian inference problem, inferring a posterior over parameters \(\boldsymbol{\theta}\) given a prior and data. Both aim to capture uncertainty and variability in parameters beyond point estimates.

\emph{(ii) Comparison of Different Flow Models}: This group evaluates various NF architectures for flexible and expressive parameter distribution modeling, including Planar flows~\citep{rezende2015variational}, RealNVP~\citep{dinh2016density}, Glow~\citep{kingma2018glow}, RQ-NSF (Rational-Quadratic Neural Spline Flow)~\citep{durkan2019neural}, and ResFlow (Residual Flow)~\citep{chen2019residual}.

\emph{(iii) Prediction Tasks}: Here, we compare established TPP models for event prediction. Non-parametric baselines include GM-NLF~\citep{eichler2017graphical}, MMEL~\citep{zhou2013learning}, and Gibbs-Hawkes~\citep{zhang2018efficient}. Other flexible TPP models include RMTPP~\citep{du2016recurrent}, THP~\citep{zuo2020transformer}, PromptTPP~\citep{xue2023prompt}, HYPRO~\citep{xue2022hypro}, MLE-SGL~\citep{xu2016learning}, and GC-CGD~\citep{wei2022granger}. AttNHP~\citep{yang2021transformer} serves as the base model for PromptTPP and HYPRO.


% \paragraph{Baselines} We choose several state-of-the-art baselines for three different evaluation scenarios: \emph{(i) Parameter Distribution Learning Tasks}: Hypernet~\citep{ha2016hypernetworks, chauhan2023brief} and $\beta$-VAE~\citep{higgins2017beta}. We utilize samples from Hypernets and the latent representation of $\beta$-VAE to estimate the distributions. % Flow VRAE (Flow-based Variational Sequence Autoencoder)~\cite{chien2022flow}.
% \emph{(ii) Comparsion of Different Flow Models}: Planer~\citep{rezende2015variational}, RealNVP~\citep{dinh2016density}, Glow~\citep{kingma2018glow}, RQ-NSF (Rational-Quadratic Neural Spline Flow)~\citep{durkan2019neural}, and ResFlow (Residual Flow)~\citep{chen2019residual}.
% \emph{(iii) Prediction Tasks}: We explore a range of models, including non-parametric ones like GM-NLF~\citep{eichler2017graphical}, MMEL~\citep{zhou2013learning} and Gibbs-Hawkes~\citep{zhang2018efficient}, and parametric ones such as RMTPP~\citep{du2016recurrent}, THP~\citep{zuo2020transformer}, PromptTPP~\citep{xue2023prompt},  HYPRO~\citep{xue2022hypro}, MLE-SGL~\citep{xu2016learning}, and GC-CGD~\citep{wei2022granger}. For PromptTPP and HYPRO, we choose AttNHP~\citep{yang2021transformer} as their base model. %, which is an attention-based auto-regressive model. 
% Among them, RMTPP, THP, MLE-SGL, GC-CGD, GM-NLF, MMEL, and Gibbs-Hawkes are specially designed for Hawkes processes while PromptTPP and HYPRO are designed for general point processes. 
\paragraph{Evaluation Metrics} 
In multivariate TPPs, parameter learning can be decomposed over event types. We {\it fix a target type \( u \) and evaluate how well the model captures how other types \( u' \) influence it}. Specifically, we consider:

\emph{(i) Parameter Distribution Accuracy}: We evaluate the quality of learned parameter distributions (e.g., \( [\alpha_{uu'}]_{u'\in U} \), \( [\delta_{uu'}]_{u'\in U} \)) using the average of marginal KL divergence:
\begin{equation}
    \text{aKL} = \frac{1}{U}\sum_{u'=1}^{U}\frac{1}{N}\sum_{n=1}^{N}p(x_{(uu'), n})\log\left ( \frac{p(x_{(uu'), n})}{\hat{p}(x_{(uu'), n})} \right )
    \label{eq:avg_kl}
\end{equation}
Here, \( p \) is the true density, \( \hat{p} \) is the estimated one, and \( x_{(uu'), n} \) denotes sampled parameters. Since joint distribution estimation suffers from the curse of dimensionality, we report marginal KL as a tractable proxy. We report the detailed computation process in Appendix~\ref{appendix_subsec:computation_of_kl}.

\emph{(ii) Prediction Accuracy}: We use Root Mean Squared Error (RMSE) to evaluate prediction of the target event $u$'s event times, following prior work \citep{du2016recurrent, zuo2020transformer}:
\begin{equation}
    \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N(\hat{t}_i - t_i)^2}
\end{equation}



% \paragraph{Evaluation metrics} For evaluation metrics, we consider two aspects: 
% \emph{(i) Accuracy of Parameter Distribution Learning}: For the target dimension-$u$, we compute the average KL divergence:
% \begin{equation}
%     aKL = \frac{1}{U}\sum_{u'=1}^{U}\frac{1}{N}\sum_{n=1}^{N}p(x_{(uu'), n})\log\left ( \frac{p(x_{(uu'), n})}{\hat{p}(x_{(uu'), n})} \right )
%     \label{eq:avg_kl}
% \end{equation}
% where $U$ indicates the overall dimensions and $N$ indicates all the samples from our model. We denote the learned density of the $n$-th sample $x_{(uu'), n}$ as $\hat{p}(x_{(uu'), n})$ and denote the corresponding true density as $p(x_{(uu'), n})$. Here, $x_{(uu')}$ can be $\alpha_{uu'}$ or $\delta_{uu'}$. We report the detailed computation process in Appendix~\ref{appendix_subsec:computation_of_kl}.

% \emph{(ii) Accuracy of Prediction}: We use Root Mean Squared Error (RMSE) to evaluate the prediction of occurrence times, as done by \citet{du2016recurrent} and \citet{zuo2020transformer}:
% \begin{equation}
%     \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N(\hat{t}_i - t_i)^2}
% \end{equation}

% {\small \begin{equation}
%     \text{RMSE} = \left (\frac{1}{N}\sum_{i=1}^N(\hat{t}_i - t_i)^2 \right )^{1/2}
% \end{equation}}

\subsection{Synthetic Data Experiments}

\begin{figure*}[htbp] 
\centering 
\includegraphics[width=1.0\textwidth]{UAI 2025 Camera-Ready/Fig/compare_true_density_vs_learned_density_joint_simple_version.pdf} 
\caption{Visualization examples comparing the performance of various models on parameter distribution learning tasks with 3-dimensional datasets and 7500 samples. We report the learned marginal distribution for $\alpha_{31}$ and $\delta_{31}$ in these figures. Complete results can be found in Appendix \ref{appendix_subsec:complete_comparsion_distributions}.}
\label{fig:example_true_density_learned_density} 
\end{figure*}

\paragraph{Preprocessing} 
We consider both \emph{uni-modal} (Gaussian) and \emph{multi-modal} (Gaussian mixture) marginal distributions for each parameter. To evaluate scalability, we vary the dimension of synthetic Hawkes process datasets in $\{2, 3, 5, 7, 9\}$ and the number of training samples in $\{2500, 5000, 7500, 10000, 12500, 15000\}$. Leveraging the decomposability of the Hawkes likelihood, we focus on a single target dimension $u$ for each case. We analyze the impact parameters $[\alpha_{uu'}]_{u'\in U}$ and delays $[\delta_{uu'}]_{u'\in U}$, assuming known base intensity $\mu_u$ and decay rate $\beta$ for synthetic data, while learning all parameter distributions for real-world datasets. We further test the model’s robustness under varying decay rate $\beta$ distributions across event types (Appendix~\ref{appendix_subsec:compare_performance_vary_beta}).


% \paragraph{Preprocessing} We analyze two distribution modes: \emph{uni-modal (Gaussian distribution)} and \emph{multi-modal (Gaussian mixture distribution)} for each parameter marginal distribution. To explore the scalability of our model, we consider the dimensions of synthetic Hawkes process datasets within $\{2, 3, 5, 7, 9\}$ and vary the size of training samples within
% $\{2500, 5000, 7500, 10000, 12500, 15000\}$. For each case, due to the decomposability of the Hawkes process likelihood, we focus solely on a single target dimension. For ease of interpretation, we specifically investigate the impact $\boldsymbol{\alpha}$ and delay effect $\boldsymbol{\delta}$ on the intensity and assume base $\boldsymbol{\mu}$ and decay rate $\beta$ are known for synthetic datasets, whereas learn all parameter distributions for real-world datasets in the main text. We also validate the performance of our model on synthetic datasets with varied $\beta$ distribution, which can also be learned, among different event types. Detailed analysis can be found in Appendix~\ref{appendix_subsec:compare_performance_vary_beta}. 


\paragraph{Parameter Distribution Learning Performance} 
Figure~\ref{fig:example_true_density_learned_density} compares the marginal parameter distributions learned by our model, Hypernet, and $\beta$-VAE. Hypernet fails to capture multi-modal patterns and lacks an explicit density form, limiting its ability to perform accurate distribution estimation. $\beta$-VAE performs competitively on uni-modal distributions but struggles with multi-modal cases and requires prior knowledge of the underlying distribution, limiting its generalizability. In contrast, our model accurately recovers both uni-modal and multi-modal distributions without any prior assumptions. To quantitatively evaluate performance, we adopt a consistent KL divergence computation (Appendix~\ref{appendix_subsec:computation_of_kl}). The numerical results in Table~\ref{tab:compare_accuracy} (Appendix~\ref{appendix_subsec:complete_comparsion_distributions}) further confirm that our method consistently outperforms both Hypernet and $\beta$-VAE across various sample sizes.





% \paragraph{Parameter Distribution Learning Performance} Figure \ref{fig:example_true_density_learned_density} presents examples of learning performance of marginal parameter distribution among three generative models. Seeing from the results, Hypernet fails to accurately uncover the parameter distributions with multi-modal patterns. In addition, unlike other two models, Hypernet can only generate samples but cannot yield corresponding densities, limiting its ability for distribution learning. Although $\beta$-VAE shows competitive performance in uni-modal distributions, it also fails to accurately capture multi-modal patterns, exhibiting significant deviations from the true distribution. Furthermore, the generalizability of $\beta$-VAE is inherently constrained by its dependency on prior knowledge, i.e., the prerequisite of knowing the true underlying distribution format, which limits its applicability across diverse datasets and real-world scenarios where such prior information may not be readily available or precisely definable. Our model, without the need for prior knowledge, successfully uncovers both uni-modal and multi-modal distributions. For fair comparison, we address the aligned computation approach of KL divergence in Appendix \ref{appendix_subsec:computation_of_kl} and the numerical results with varying sample sizes in Table \ref{tab:compare_accuracy}, Appendix \ref{appendix_subsec:complete_comparsion_distributions}, further corroborate the aforementioned conclusion, since our model outperforms both Hypernet and $\beta$-VAE in most cases.

% As the training sample size increases, our model demonstrates improved accuracy, reflected in the decreasing KL divergence between the learned distribution and the true distribution. This outperformance is notable compared to both Hypernet and $\beta$-VAE in most cases.

Our flow-based model effectively captures the joint distribution and dependencies within the parameter set $\boldsymbol{\theta}$. Depicted in Figure~\ref{fig:joint_distribution_and_samples_main_text}, the samples of $\boldsymbol{\alpha}$ and $\boldsymbol{\delta}$ from our well-trained model basically match ground truth joint densities and the underlying density patterns also be uncovered.

\begin{figure}[ht] 
\centering 
\includegraphics[width=0.48\textwidth]{UAI 2025 Camera-Ready/Fig/joint_density_and_samples.pdf}
\caption{True joint distribution (\textbf{contours}) of $\alpha_{31}$ and $\delta_{31}$ and samples (\textbf{red circles}) from our well-trained model using multi-modal dataset with 3-dimensional and 7500 samples.}
\label{fig:joint_distribution_and_samples_main_text} 
\end{figure}

% Furthermore, we also compare our model with traditional parametric models, including mixture
% of uniform and mixture of gaussian models in Appendix~\ref{appendix_subsec:compare_traditional_prob_models}

\paragraph{Scalability and Ablation Study} 
% Our method demonstrates comparable convergence speed and achieves the lowest prediction error while maintaining reasonable storage consumption.

% \begin{figure*}[htb] 
% \centering 
% \includegraphics[width=1.0\textwidth]{UAI 2025 Camera-Ready/Fig/scalability.pdf}
% \caption{{\small Scalability experiments with varying training samples and dimensions. Taking uni-modal distribution datasets as examples. All the experiments are conducted over 5 random runs and the standard error is reflected in the shaded areas.}}
% \label{fig:scalability} 
% \end{figure*}

To evaluate the scalability of our proposed model, we vary the dimensionality and sample sizes. Shown in Appendix \ref{appendix_subsection:scalability}, Figure \ref{fig:scalability}, as the training sample size increases, the training time increases while the converged negative log-likelihood decreases, and distribution learning accuracy increases accordingly. Our
model demonstrates high efficiency, converging within 1.2
hours even in the most complex scenarios, utilizing 15000 samples of training data with 9 dimensions.
As the dimensionality of Hawkes processes increases, the distribution learning accuracy of our model may slightly decrease but remains satisfactory. Encouragingly, as the training sample size grows, the learning performance becomes stable.

\begin{table*}[ht]
\centering
% \small
\caption{Ablation study on synthetic and real-world datasets. Our current selection of modules are highlighted in blue. For synthetic datasets, we use 7500 samples with 3 dimensions cases. We ablate the following modules: \emph{i) Delay}: whether assume that time lag (delay effect) presents in the data, \emph{ii) Dist}: whether assume the parameters (impact $\boldsymbol{\alpha}$ and delay $\boldsymbol{\delta}$) of grounded Hawkes process follow certain distributions, \emph{iii) Ensem}: whether ensemble multiple normalizing flows, and \emph{iv) DiffBase}: whether vary the input base distributions.}
% \footnotesize % 将字体大小设置为小号
% \resizebox{\textwidth}{!}
% {
\begingroup
\setlength{\tabcolsep}{4pt}
\begin{tabular}{cccc|cccc|cccc}
\hline 
\multicolumn{12}{c}{\textbf{Synthetic Dataset}} \\ 
\hline
\multirow{2}{*}{Delay.}& \multirow{2}{*}{Dist.}& \multirow{2}{*}{Ensem.}& \multirow{2}{*}{DiffBase.}& \multicolumn{4}{c|}{Uni-Modal} & \multicolumn{4}{c}{Multi-Modal} \\
\cline{5-12} 
& & & & NLL $\downarrow$& aKL ($\alpha$)  $\downarrow$ & aKL ($\delta$) $\downarrow$& Time $\downarrow$ & NLL $\downarrow$ & aKL ($\alpha$) $\downarrow$ & aKL ($\delta$) $\downarrow$ & Time $\downarrow$ \\
\hline
\textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & 32.62 & $-$ & $-$ & \textbf{0.15h} & 38.43 & $-$ & $-$ & \textbf{0.36h} \\
\textcolor{green}{\ding{51}} &  \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & 28.64 & $-$ & $-$ & \underline{0.16h} & 37.35 & $-$ & $-$ & \underline{0.40h} \\
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textbf{25.08} & \underline{1.26} & \underline{2.20} & 0.18h & 36.52 & 4.33 & 3.67 & 0.42h \\
\rowcolor{bestCombine}
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{red}{\ding{55}} & \underline{25.26} & \textbf{1.22} & \textbf{2.16} & 0.21h & \textbf{30.42} & \textbf{3.16} & \textbf{2.25} & 0.56h
\\
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{red}{\ding{55}} & \textcolor{green}{\ding{51}} & 26.52 & 1.37 & 2.32 & 0.20h & 33.97 & 3.86 & 3.11 & 0.52h \\
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & 25.71 & 1.32 & 2.28 & 0.38h & \underline{33.68} & \underline{3.82} & \underline{2.95} & 0.67h \\
\hline
\multicolumn{12}{c}
{\textbf{Real-World Dataset}} \\ 
\hline
\multirow{2}{*}{Delay.} & \multirow{2}{*}{Dist.} & \multirow{2}{*}{Ensem.} & \multirow{2}{*}{DiffBase.} & \multicolumn{4}{c|}{MIMIC-IV} & \multicolumn{4}{c}{Covid Policy} \\
\cline{5-12} 
& & & & NLL $\downarrow$ 
& \multicolumn{2}{c}{RMSE $\downarrow$} 
& Time $\downarrow$ 
& NLL $\downarrow$ 
& \multicolumn{2}{c}{RMSE $\downarrow$}
& Time $\downarrow$ \\
\hline
\textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & 26.15 & \multicolumn{2}{c}{3.52} & \textbf{0.18h} & 42.80 & \multicolumn{2}{c}{4.25} & \textbf{0.13h} \\
\textcolor{green}{\ding{51}} &  \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & 24.52 & \multicolumn{2}{c}{3.20} & \underline{0.24h} & 39.46 & \multicolumn{2}{c}{4.08} & \underline{0.17h} \\
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{red}{\ding{55}} & \textcolor{red}{\ding{55}} & 22.55 & \multicolumn{2}{c}{2.92} & 0.31h & 37.80 & \multicolumn{2}{c}{3.93} & 0.22h \\
\rowcolor{bestCombine}
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{red}{\ding{55}} & \textbf{21.32} & \multicolumn{2}{c}{\textbf{2.86}} & 0.34h & \underline{36.94} & \multicolumn{2}{c}{\textbf{3.35}} & 0.25h \\
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{red}{\ding{55}} & \textcolor{green}{\ding{51}} & 22.08 & \multicolumn{2}{c}{2.95} & 0.38h & 37.56 & \multicolumn{2}{c}{3.72} & 0.30h \\
\textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \textcolor{green}{\ding{51}} & \underline{21.67} & \multicolumn{2}{c}{\underline{2.90}} & 0.53h & \textbf{36.25} & \multicolumn{2}{c}{\underline{3.54}} & 0.42h \\
\hline
\end{tabular}
\endgroup
% }
\label{tab:ablation_study}
\end{table*}

To assess the importance of different components of our
model, we ablate several modules as in Table \ref{tab:ablation_study}. Violating all the modules would cause a significant degrade. Furthermore, removing ensemble modules would lead to a decrease in the model’s performance, especially for multi-modal distributions. Under current modules combination, our model almost achieves the
lowest training converged negative log-likelihood, highest learning accuracy, while maintaining
relatively high time efficiency. 

Moreover, we investigate the performance when using different normalizing flow models in Table \ref{tab:selection_nf}, Appendix \ref{appendix_subsec:compare_different_flow_models}. Taking into account factors such as data volume and dimensionality, our model strikes a balance between model effectiveness and training efficiency and can select suitable normalizing flow models for different datasets. Detailed selections of flow models are reported in Appendix \ref{appendix_subsection:hyper_param_selection}.

\paragraph{Prediction} The learned parameter distributions will facilitate  prediction of upcoming events. The prediction results on synthetic datasets are presented in Table \ref{tab:prediction_event_time}, from which one can observe that our model outperforms all baselines.

\begin{table*}[ht]
\centering
% \small
\caption{Event time prediction RMSE $\downarrow$ on two synthetic datasets using 7500 samples with 3 dimensions case (denoted as ``Uni-Modal'' and ``Multi-Modal'' to indicate different underlying parameter distributions), MIMIC-IV data (evaluating prediction of occurrence time of normal urine event of patients) and Covid Policy data (evaluating prediction of occurrence time of dropping confirmed cases/infections). Results from our model are shaded in red.}

\begin{tabular}{cccccc} 
\toprule
Category & Method & Uni-Modal &  Multi-Modal & MIMIC-IV & Covid Policy \\ 
\hline
\multirow{3}{*}{Non-Param.} 
 ~ & GM-NLF~\citep{eichler2017graphical} & 2.36 & 2.72 & 4.29 & 6.72 \\
% \cline{2-4}
 ~  & MMEL~\citep{zhou2013learning} & 2.41 & 2.85 & 4.47 & 6.45 \\
% \cline{2-4}
 ~  & Gibbs-Hawkes~\citep{zhang2018efficient} & 1.98 & 2.64 & 3.87 & 6.12 \\
 \hline
 \multirow{7}{*}{Param.} & RMTPP~\citep{du2016recurrent} & 2.15 & 2.77 & 3.82 & 5.24 \\
% \cline{2-4}
 ~  & THP~\citep{zuo2020transformer} & 1.92 & 2.46 & 3.26 & 5.08 \\
% \cline{2-4}
 ~  & PromptTPP~\citep{xue2023prompt} & \underline{1.85} & 2.40 & 3.13 & \textbf{3.18} \\
% \cline{2-4}
 ~  & HYPRO~\citep{xue2022hypro} & 1.89 & \underline{2.37} & \underline{3.05} & 3.42 \\
% \cline{2-4}
 ~  & MLE-SGL~\citep{xu2016learning} & 1.96 & 2.57 & 3.63 & 5.81 \\
% \cline{2-4}
 ~  & GC-CGD~\citep{wei2022granger} & 1.90 & 2.45 & 3.18 & 5.26  \\
% \cline{2-4}
\rowcolor{ours}
 ~  & \textbf{Ours*} & \textbf{1.79} & \textbf{2.25} &  \textbf{2.86} & \underline{3.35} \\
      \bottomrule
\end{tabular}

\label{tab:prediction_event_time}
\end{table*}


\subsection{MIMIC-IV Dataset Experiments} 
\paragraph{Preprocessing} MIMIC-IV\footnote{\url{https://mimic.mit.edu/}} is an electronic health record dataset of patients admitted to the intensive care unit (ICU)~\citep{johnson2023mimic}.
We focused on patients diagnosed with sepsis \citep{saria2018individualized}, a leading cause of mortality in the ICU. Following the approach suggested by \citet{komorowski2018artificial},
% We considered patients diagnosed with sepsis \cite{saria2018individualized}, one of the major causes of mortality in ICU due to septic shock. Suggested by \citep{komorowski2018artificial}, 
we selected 21 treatments categorized as vasopressors, antibiotics, and auxiliary treatment (details shown in Appendix \ref{appendix_subsection:mimic_iv}) from which a total of 7377 samples were extracted. Since normal urine reflects the impact of drugs and treatments on improving the physical condition of a patient, our objective is to uncover impact and delay effect of treatments on patients' physical well-being, as observed through normal urine events.

\paragraph{Ablation Study} % We want to emphasize the importance of conducting ablation studies on real-world datasets. 
Since we are uncertain about the presence of delay effects and whether parameters adhere to specific distributions in real datasets, we need to validate our assumptions through ablation studies first. In Table \ref{tab:ablation_study}, for the MIMIC-IV dataset, assuming no delay effect and fixed parameters results in a higher converged negative log-likelihood and decreased accuracy in prediction tasks, validating the rationale behind our current configuration.

\paragraph{Case Study and Prediction} Shown in Figure \ref{fig:delay_effect_mimic_iv}, the positive impact of vasoconstrictors, antibiotics, and auxiliary treatment are comparable. The delay effect distribution of vasoconstrictors exhibits a right-skewed pattern, with a mean around 0.5 hours, indicating that vasoconstrictors show clearly short-term response to yield a positive impact on human circulatory systems. Antibiotics typically require longer time to take effect, with a time lag distribution in the population generally following a normal distribution centered around a mean of 1.2 hours. Due to auxiliary treatment encompassing various therapies such as Furosemide and Invasive Ventilation, its delay effect displays a multi-modal pattern with the first local peak around 0.3 hours and the second local peak appearing near 0.8 hours.  

As in Table \ref{tab:prediction_event_time}, our model accurately predicts the next normal urine event with the lowest RMSE than all other baselines.

\begin{figure}[ht] 
\centering 
\includegraphics[width=0.48\textwidth]{UAI 2025 Camera-Ready/Fig/delay_effect_mimic_iv_new.pdf}
\caption{Learned impact $\boldsymbol{\alpha}$ and delay effect $\boldsymbol{\delta}$ distributions for MIMIC-IV dataset (\textbf{left and bottom bar plots}) and 2D scatter plot for samples of these two parameters (\textbf{top-right}).}
\label{fig:delay_effect_mimic_iv} 
\end{figure}


\subsection{Covid Policy Dataset Experiments}

\paragraph{Preprocessing} %COVID-19 is an unprecedented pandemic and various control measures have been introduced to curb the spread of the virus. 
The Covid-19 Policy dataset \footnote{\url{https://github.com/OxCGRT/covid-policy-dataset}} collects data on governments’ implementation of specific measures and their timing to control COVID-19 pandemic ~\citep{hale2021global, hale2020oxford}. Epidemic prevention policies are organized into 4 categories including Containment Closure, Healthcare System, Vaccination, and Economic policies, which could be referred to Appendix \ref{appendix_subsection:covid_policy_tracker}. We conducted experiments on data from Australia and France for the years 2021-2022 based on the most severe COVID-19 situations and effective governmental policies, aiming to investigate the impact of the policies from different categories on event of dropping daily average number of confirmed cases.

\paragraph{Ablation Study} Like in MIMIC-IV, the ablation study in Table \ref{tab:ablation_study} also shows that assuming the existence of delay effects in the data and parameters following specific distributions indeed enhances model performance, validating that the data align with our assumptions. 

\paragraph{Case Study and Prediction} In Figure \ref{fig:delay_effect_covid_policy_Australia} and \ref{fig:delay_effect_covid_policy_France}, Appendix \ref{appendix_subsection:covid_policy_tracker}, overall, the lag for the effectiveness of government policies in Australia seems shorter than in France. Here, we take Containment Closure (CC) policies as examples, whose positive impact is normally distributed and is almost larger than all other policies in these two countries. In Australia, when government enforces CC policies, the population generally divides into two groups (exhibiting two peaks in the distribution). One group promptly complies with isolation measures, leading to a decrease in new cases of COVID-19 around 5 days after policy implementation. The other group responds more slowly, requiring approximately 6.5 days for the policies to take effect. The pattern of CC policies in France is different, roughly following a normal distribution with a mean of 7.5 days. In terms of Healthcare System policies, delay effects in Australia also exhibit a bimodal distribution, with peaks at around 7 and 9 days, whereas in France with longer onset times, approximately 7.5 and close to 11 days, respectively, but the variance is smaller. In both countries, the impact of Vaccination and Economic policies is smaller than that of the two policies mentioned above. In Australia, the time lag for Vaccination to take effect typically peaks at 10 and 11 days, while for Economic policies, the mean time lag is 14.5 days. In France, the mean time lags are approximately 11.5 and 15 days, respectively, for these two policies.

% In terms of Healthcare System policies, delay effects in Australia also exhibit a bimodal distribution, with peaks at around 7 and 9 days, whereas in France with longer onset times, approximately 7.5 and close to 11 days, respectively, but the variance is smaller. In both countries, the impact of Vaccination and Economic policies is smaller than that of the two policies mentioned above. In Australia, the time lag for Vaccination to take effect typically peaks at 10 and 11 days, while for Economic policies, the mean time lag is 14.5 days. In France, the mean time lags are approximately 11.5 and 15 days, respectively, for these two policies.

Our model also has demonstrated competitive prediction performance on the COVID policy dataset. As in Table \ref{tab:prediction_event_time}, the RMSE of predicting the time of next infection dropping event is the second lowest among all baselines, closely approaching the lowest.

\subsection{Other Real-World Dataset Experiments} 
In addition to the healthcare datasets, we also considered the StackOverflow~\citep{leskovec2014snap}, Taobao~\citep{xue2022hypro}, and Taxi~\citep{whong2014foiling} datasets for prediction tasks (predicting both the next event type and its time, while incorporating per-event negative log-likelihood and event type prediction error rate as extra evaluation metrics) to ensure broader applicability and validate our approach across diverse domains. As shown in Table~\ref{tab:experiment_other_real_world_dataset}, Appendix~\ref{appendix_subsection:other_real_world_dataset_experiments}, our model maintains
strong performance across these baselines on almost all datasets. This also suggests that StackOverlow and Taobao datasets may inherently exhibit delayed event-triggering effects (while Taxi dataset may not) among different event types, further validating our approach.

\section{Conclusion}
\label{sec:conclusion}
In this paper, we propose the Flow-based Delayed Hawkes Process, an extension of multivariate Hawkes models that uses normalizing flows to flexibly model the distribution of parameters, capturing heterogeneous event dynamics while preserving interpretability. We provide theoretical guarantees on parameter identifiability and MLE consistency under mild conditions. Experiments on synthetic and real-world data demonstrate consistent superiority over state-of-the-art baselines in modeling diverse temporal event patterns. This work advances accurate and interpretable analysis of event data with delay effects and complex triggering behaviors.


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    % Briefly acknowledge people and organizations here.

    % \emph{All} acknowledgements go in this section.
Shuang Li’s research was in part supported by the Key Program of the NSFC under grant No. 72495131, NSFC
under grant No. 62206236, Shenzhen Stability Science Program 2023, Shenzhen Science and Technology Program ZDSYS20230626091302006, Longgang District Key Laboratory of Intelligent Digital Economy Security and SRIBD Innovation Fund SIF20240010.

\end{acknowledgements}

% References
% \bibliography{uai2025-template}
\bibliography{reference}

\newpage

\onecolumn

% \title{Flow-Based Delayed Hawkes Process\\(Supplementary Material)}
% \maketitle

% This Supplementary Material should be submitted together with the main paper.

\appendix
\section*{Appendix Overview}
In the following, we will provide supplementary materials to better illustrate our methods and experiments.

\begin{itemize}
    \item Section \ref{appendix:proof} provides theoretical guarantees.
    \item Section \ref{appendix:implementation} presents more details of our model and implementation.
    \item Section \ref{appendix:reproducibility analysis} reports the reproducibility analysis.
    \item Section \ref{appendix:more_synthetic_data_experiment} provides more synthetic dataset experiments and corresponding analysis.
    \item Section \ref{appendix:more_real_world_data_experiment} provides more real-world dataset experiments and corresponding analysis.
    \item Section \ref{appendix:limitations_broader_impacts} states the limitations and broader impacts of our proposed model.
\end{itemize}


\section{Theoretical Proofs}
\label{appendix:proof}
\subsection{Proof of Theorem \ref{them_2}}
\label{appendix:proof_them_2}
\begin{proof}[Theorem \ref{them_2}] 
Define the operator \( \mathcal{T} \) that maps a latent distribution \( p(\boldsymbol{\theta}) \) to its marginal intensity:  
\[
\mathcal{T}(p)(t) = \int_{\Theta} f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}.
\]  
By assumption, \( \mathcal{T}(p) = \mathcal{T}(q) \) for almost every \( t \), so defining \( g(\boldsymbol{\theta}) = p(\boldsymbol{\theta}) - q(\boldsymbol{\theta}) \), we obtain:  
\[
\int_{\Theta} f_u(t \mid \mathcal{H}_t; \boldsymbol{\theta}) g(\boldsymbol{\theta}) d\boldsymbol{\theta} = 0 \quad \text{for almost every } t.
\]  
By the completeness assumption, this implies \( g(\boldsymbol{\theta}) = 0 \) almost everywhere, concluding that \( p(\boldsymbol{\theta}) = q(\boldsymbol{\theta}) \) almost everywhere.  
\end{proof}   

\subsection{Proof of Completeness for the Smooth Triggering Function}
\label{appendix:proof_completeness}
\begin{proof}[Completeness for the Smooth Triggering Function] 
Consider the intensity function:
$$
f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)=\mu_u+\sum_{u^{\prime}=1}^U \sum_{n=1}^{N_u(t)} \alpha_{u u^{\prime}} h\left(-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)\right)
$$
where $\mu_u \geq 0$, $\alpha_{u u^{\prime}} \geq 0$, $\beta>0$, $\delta_{u u^{\prime}} \geq 0$, and $h(\cdot)$ is a smooth function (i.e., sigmoid-exponential product). We prove the family $f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)$ is complete, i.e., if

$$
\int_{\Theta} f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right) p(\boldsymbol{\theta}) d \boldsymbol{\theta}=0 \quad \forall t
$$
then $p(\boldsymbol{\theta})=0$ almost everywhere.

Substitute $f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)$ into the integral equation:
$$
\int_{\Theta} \mu_u p(\boldsymbol{\theta}) d \boldsymbol{\theta}+\sum_{u^{\prime}=1}^U \sum_{n=1}^{N_u(t)} \int_{\Theta} \alpha_{u u^{\prime}} h\left(-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)\right) p(\boldsymbol{\theta}) d \boldsymbol{\theta}=0 \quad \forall t
$$
The first term is $t$-independent, while the second term depends on $t$ through $h(\cdot)$. For equality to hold globally, both terms must vanish individually.

The $t$-independence of the first term implies:
$$
\int_{\Theta} \mu_u p(\boldsymbol{\theta}) d \boldsymbol{\theta}=0
$$


Since $\mu_u \geq 0$, this forces $\int_{\Theta} p(\boldsymbol{\theta}) d \boldsymbol{\theta}=0$. For the second term, smoothness and the parametric structure of $h(\cdot)$ ensure the family $\left\{h\left(-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)\right)\right\}$ is linearly independent for distinct $\left(\beta, \delta_{u u^{\prime}}, t_n^{u^{\prime}}\right)$. By the Haar condition (for Chebyshev systems), a nontrivial linear combination of these functions cannot vanish identically unless all coefficients are zero.

The integral equation reduces to a moment problem:
$$
\sum_{u^{\prime}=1}^U \sum_{n=1}^{N_u(t)} \int_{\Theta} \alpha_{u u^{\prime}} h\left(-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)\right) p(\boldsymbol{\theta}) d \boldsymbol{\theta}=0 \quad \forall t
$$


Because $h(\cdot)$ generates a complete basis (via exponential/sigmoid properties), the only solution is $p(\boldsymbol{\theta})=0$ almost everywhere.

As a conclusion, the linear independence of $\left\{h\left(-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)\right)\right\}$ and the Haar condition ensure uniqueness. Hence, the family $\{f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)\}$ is complete, and:
$$
\lambda_u\left(t \mid \mathcal{H}_t ; p(\boldsymbol{\theta})\right)=\lambda_u\left(t \mid \mathcal{H}_t ; q(\boldsymbol{\theta})\right) \Longrightarrow p(\boldsymbol{\theta})=q(\boldsymbol{\theta})
$$
\end{proof}


\subsection{Proof of Theorem \ref{them_3}}
\label{appendix:proof_theorem_MLE}
\begin{proof}[Proof Sketch of Theorem~\ref{them_3}] 
First, we show the {\bf uniform convergence}: Under standard regularity conditions for point processes, the empirical log-likelihood function converges uniformly (by the law of large numbers) to its expected value as the observation window $T$ grows. That is, for all candidate distributions $p$,
$$
\frac{1}{T} \mathcal{L}\left(p\right) \rightarrow \mathbb{E}\left[\frac{1}{T} \mathcal{L}\left(p\right)\right] \quad \text { almost surely }
$$


Second, we have proved the {\bf identifiability}. By Theorems \ref{them_1} and \ref{them_2}, the mapping from the latent parameters $\boldsymbol{\theta}$ to the intensity function $f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)$ is injective, and the family $\mathcal{F}$ is complete. Therefore, the expected $\log$ likelihood has a unique maximum at the true distribution $p^*(\boldsymbol{\theta})$.

Then, we can use the standard MLE consistency results:
The standard Wald's consistency theorem implies that the our maximizer of the empirical log likelihood, $\hat{p}(\boldsymbol{\theta})$, converges in probability to $p^*(\boldsymbol{\theta})$ as $T \rightarrow \infty$.
\end{proof} 

% \subsection{Proof of the Universal Approximation for Marginal Intensity (Theorem~\ref{them_4})}
% \label{appendix:proof_universal}


% \begin{proof}[Proof Sketch of Theorem~\ref{them_4}] \newcommand{\bx}{\mathbf{x}}
% \newcommand{\entmax}{\mathrm{entmax}}
% \newcommand{\E}{\mathbb{E}}We prove that for any $\epsilon>0$, there exists a mixture of $M$ normalizing flows, each with $\pi_m=1/M$ such that the corresponding mixture $\hat{p}$ satisfies the desired bound.

% Since the expected squared error of the best $H$-class model is at most the expected squared error of any $H$-class model whose parameters are i.i.d.~samples from the true distribution. As such, to prove the theorem, it suffices to show that the expected squared error of an $H$-class model whose parameters are i.i.d.~samples from the true distribution is at most $\epsilon$. 

% For event $i$, we denote the predicted probability of occurrence under this $H$-class model as 
% \[
%   \bar{q}_H^i = \frac1H \sum_{h=1}^H \hat{q}_h^i,
% \]
% where each $\hat{q}_h^i$ is computed using a set of parameters sampled from the true parameter distribution $\mu_*$. As such, it satisfies $\E[\hat{q}_h^i] = q_*^i $, where the expectation is taken over the above sampling distribution.
% Now, let us compute 
% \[
%   \E\Big[ \frac1N \sum_{i=1}^N  \| \bar{q}_H^i - q_*^i \|^2 \Big],
% \]
% where the expectation is taken over the sampling distribution.
% Substituting the definition of $\bar{q}_H^i$, this equals
% \[
% \E\Big[ \frac1N \sum_{i=1}^N  \|  \frac1H\sum_{h=1}^H \hat{q}_h^i - q_*^i \|^2  \Big].
% \]
% Using the independence among $\hat{q}_h^i$, $h=1,\ldots,H$, we apply the variance reduction property of averaging i.i.d.~samples:
% \[
% \frac1{H} \E\Big[ \frac1N \sum_{i=1}^N  \| \hat{q}_1^i - q_*^i \|^2 \Big],
% \]
% where $\hat{q}_1^i$ is a single unbiased estimation of $q_*^i$.
% We bound the variance of $\hat{q}_1^i$ as
% \[
%   \E\big[ \| \hat{q}^i_1 - q_*^i \|^2\big] \le 2,\quad \forall i.
% \]
% Thus, the expected square error is bounded by $2/H$. Therefore, to ensure an $\epsilon$ error, it suffices to choose $H=2/\epsilon$, which completes the proof.
% \end{proof}


% \begin{proof}
% Let's first show the continuity and boundedness of $f_u$.
% By assumption, $h(\cdot)$ is smooth, and $\Theta$ is compact. Since $\mu_u, \alpha_{u u^{\prime}}, \beta, \delta_{u u^{\prime}}$ are parameters in $\boldsymbol{\theta}$, and $t_n^{u^{\prime}}$ are event times in $\mathcal{H}_t$, the intensity $f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta}\right)$ is continuous in $\boldsymbol{\theta}$ for all $t \in[0, T]$. Compactness of $\Theta$ ensures $f_u$ is uniformly bounded and Lipschitz in $\boldsymbol{\theta}$ (by the Heine-Cantor theorem).

% Existing research~\citep{teshima2020universal,ishikawa2023universal} has proved that NFs are universal approximators: for any $\eta>0$, there exists a mixture $\hat{p}(\theta)=$ $\sum_{m=1}^M \pi_m p_m(\theta)$ of NFs such that
% $$
% \sup _{g \in C(\Theta)}\left|\mathbb{E}_{\boldsymbol{\theta}\sim \hat{p}}[g(\boldsymbol{\theta})]-\mathbb{E}_{\boldsymbol{\theta}\sim p^*}[g(\boldsymbol{\theta})]\right|<\eta
% $$
% where $C(\Theta)$ is the space of continuous functions on $\Theta$ (see \citep{teshima2020universal}). This follows from the density of mixtures of NFs in the space of probability measures under the weak topology.

% We show that the family $\left\{f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)\right\}_{t \in[0, T]}$ is equicontinuous in $t$. $h\left(-\beta\left(t-t_n^{u^{\prime}}-\delta_{u u^{\prime}}\right)\right)$ is smooth in $t$, with derivatives bounded uniformly over $\boldsymbol{\theta} \in \Theta$ (due to compactness). By the Arzelà-Ascoli theorem, equicontinuity and uniform boundedness imply that $\left\{f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)\right\}$ is precompact in $C([0, T])$.

% For $g(\boldsymbol{\theta})=f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)$, the approximation error satisfies:
% $$
% \left|\lambda_u(t ; \hat{p})-\lambda_u^*(t)\right|=\left|\mathbb{E}_{\hat{p}}\left[f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)\right]-\mathbb{E}_{p^*}\left[f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)\right]\right| \leq \sup _{g \in \mathcal{F}}\left|\mathbb{E}_{\hat{p}}[g]-\mathbb{E}_{p^*}[g]\right|,
% $$
% where $\mathcal{F}=\left\{f_u\left(t \mid \mathcal{H}_t ; \boldsymbol{\theta} \right)\mid t \in[0, T]\right\}$. Choose $\eta=\epsilon$. By equicontinuity, $\mathcal{F}$ is compact in $C([0, T])$, so the supremum is achieved, yielding:
% $$
% \sup _{t \in[0, T]}\left|\lambda_u(t ; \hat{p})-\lambda_u^*(t)\right|<\epsilon .
% $$
% \end{proof}




\section{Implementation Details}
\label{appendix:implementation}

% \subsection{Illustration of Our Newly Introduced Delayed Effect in Hawkes Process}
% \label{appendix_subsection:illustrate_delay_effect}
% In Figure \ref{fig:diff_standard_and_delayed}, we visualize the difference between standard multivariate Hawkes process and the multivariate Hawkes process with delay effect.

% \begin{figure}[ht] 
% \centering 
% \includegraphics[width=0.5\textwidth]{UAI 2025 Camera-Ready/Fig/diff_standard_and_delayed.pdf}
% \caption{Illustration of the standard multivariate Hawkes Process without (\textbf{top}) and with (\textbf{bottom}) delay effects. The time lag effect $\delta_{uu'}$ captures the delay effect on intensity from dimension $u'$ to $u$.}
% \label{fig:diff_standard_and_delayed} 
% \end{figure}

\subsection{Dynamic sigmoid mask module}
\label{appendix_subsec:dynamic_sigmoid_mask}
We want to emphasize that the exponential kernel of Hawkes process contains an indicator function, which would result in the interruption of gradients during backpropagation. To address this issue, we propose a dynamic sigmoid mask module,
\begin{equation}
    \mathbb{I}(t - t_{n}^{u'} - \delta_{uu'} \geq 0) := \text{sigmoid}(\frac{C}{\gamma_t}\cdot(t - t_{n}^{u'} - \delta_{uu'}))
\end{equation}
where $C$ is a large constant and ${\gamma_t}$ is the cyclical annealing temperature, which is given by
\begin{equation}
    \gamma_t = \left\{\begin{matrix}
                 h(\tau), \quad \tau \leq R \\ 
                   c, \quad \tau > R   \end{matrix}\right. \quad \text{with}\quad \tau = \frac{\text{mod}(t - 1, \left \lceil B/M \right \rceil)}{B/M}
\end{equation}
where $t$ is the iteration number, $B$ is the total number of iteration, $c < C$ is a fixed constant, $h(\cdot)$ is a monotonically increasing function with value start with $1$, $M$ is the number of cycles, and $R$ represents the proportion used to increase $\gamma$ within a cycle. In other words, we split the training process into $M$ cycles, each starting with $\gamma = 1$ and ending
with $\gamma = C$. Within one cycle, there are two consecutive stages (divided by $R$), one is the annealing stage and the other is the fixing stage. This ensures that the output of the dynamic sigmoid mask module approximates the binary output (0 or 1) of the original indicator function while preserving gradient flow and preventing gradient stagnation. 

In our implementation, we can confirm that our model strictly enforces temporal causality by: (i) Explicit temporal masking: we apply strict masking to ensure that only events where $t_n^{u'} < t$ can contribute to the intensity function at time $t$. This masking is applied immediately after computing the sigmoid values but before they contribute to the intensity calculation. (ii) Batched computation structure: while we do parallelize kernel computations for efficiency, the implementation enforces a strict time-ordering constraint. The code includes explicit conditional filtering that zeros out any influence from event times $t_n^{u'}$ that occur after the evaluation time $t$. (iii) Training evaluation consistency: this masking remains consistent between training and evaluation phases, regardless of the annealing schedule of the sigmoid temperature parameters.

\subsection{Computation of KL Divergence}
\label{appendix_subsec:computation_of_kl}
Due to the inherent characteristics of different deep generative models, we must standardize the KL divergence computation metric to ensure a fair comparison of their performance. For our flow-based model, we can obtain samples and corresponding learned densities from well-trained model. The average KL divergence can be directly computed according to Eq.~\eqref{eq:avg_kl}. Hypernet can only generate samples but
cannot yield corresponding densities. To align with current computation approach of KL divergence, after obtaining the samples from well-trained Hypernet model, we fit the learned distributions based on samples and then get the corresponding learned densities. Note that in this process, we assume the distribution format is known for Hypernet samples. For $\beta$-VAE, we directly obtain the learned parameter distribution from the latent representation. Therefore, we can sample from the latent distributions and know the corresponding learned densities.

For our flow-based model, Hypernet model, and $\beta$-VAE, we plug the samples from well-trained models into the ground truth distributions to get the corresponding ground truth densities for these samples so that we can compute the KL divergence according to Eq.~\eqref{eq:avg_kl}.

\section{Reproducibility Analysis}
\label{appendix:reproducibility analysis}
\subsection{Baselines} 

\paragraph{Parameter Distribution Learning Tasks}

\begin{itemize}
    \item Hypernet~\citep{ha2016hypernetworks, chauhan2023brief}: We utilize hypernets to obtain the samples and use these samples to compute likelihood of Hawkes process and therefore backward training the hypernet. 
    \item $\beta$-VAE~\citep{higgins2017beta}: We utilize the latent representation of $\beta$-VAE to estimate the parameter distributions of Hawkes processes.
\end{itemize}

\paragraph{Comparsion of Different Flow Models}

\begin{itemize}
    \item Planer~\citep{rezende2015variational}: For this model, the approximations of distributions are through a normalizing flow, whereby  transforming a simple initial density into a more complex one by applying a sequence of invertible
transformations until a desired level of complexity is attained.
    \item RealNVP~\citep{dinh2016density}: It uses real-valued non-volume preserving (Real NVP) transformations, which are stably invertible and learnable transformations. 
    \item Glow~\citep{kingma2018glow}: It is a simple type of generative flow using an invertible $1 \times 1$ convolution. 
    \item RQ-NSF (Rational-Quadratic Neural Spline Flow)~\citep{durkan2019neural}: A fully-differentiable module based on monotonic rational-quadratic splines, which enhances the flexibility of both coupling and autoregressive transforms while retaining analytic invertibility.
    \item ResFlow (Residual Flow)~\citep{chen2019residual}: A flow-based generative model that produces an unbiased estimate of the log density and has memory-efficient backpropagation through the log density computation, which allows us to use expressive architectures and train via maximum likelihood.
\end{itemize}

\paragraph{Prediction Tasks}
\begin{itemize}
    \item \textbf{Non-parametric Models}
    \begin{itemize}
        % \item EM~\cite{lewis2011nonparametric}: It is a maximum penalized likelihood estimation method for simultaneously estimating the background rate and the triggering density of Hawkes process intensities that vary over multiple time scales.
        \item GM-NLF~\citep{eichler2017graphical}: It shows that the Granger causality structure of the process is fully encoded in the corresponding Hawkes kernels. It introduces a new nonparametric estimator of the Hawkes kernels based on a time-discretized version of the point process by using an infinite order autoregression. And it derived the consistency and asymptotic normality of the estimator.
        \item MMEL~\citep{zhou2013learning}: The proposed model focuses on the nonparametric learning of the triggering kernels for multi-dimensional Hawkes processes, and the proposed algorithm combines the idea of decoupling the parameters through constructing a tight upper-bound of the objective function and application of Euler Lagrange equations for optimization in infinite dimensional functional space.
        \item Gibbs-Hawkes~\citep{zhang2018efficient}: An efficient nonparametric Bayesian estimation method of the kernel function of Hawkes processes. This method is based on the cluster representation of Hawkes processes. Utilizing the finite support assumption of the Hawkes process, it efficiently samples random branching structures, and thus, splits the Hawkes process into clusters of Poisson processes. By using the a block Gibbs sampler, the samples building the estimation can converge to the desired posterior.
    \end{itemize}
    \item \textbf{Parametric Models}
    \begin{itemize}
        \item RMTPP~\citep{du2016recurrent}: The approach considers the intensity function of a temporal point process as a nonlinear function that depends on the history. It utilizes a recurrent neural network to automatically learn a representation of the influences from the event history, which includes past events and time intervals, thereby fitting the intensity function of the temporal point process.
        \item THP~\citep{zuo2020transformer}: The model employs a concurrent self-attention module to embed historical events and generate hidden representations for discrete time stamps. These hidden representations are then used to model the interpolated continuous time intensity function. THP can also incorporate additional structural knowledge. Importantly, THP surpasses RNN-based approaches in terms of computational efficiency and the ability to capture long-term dependencies.
        \item PromptTPP~\citep{xue2023prompt}: The model incorporates a continuous-time retrieval prompt pool into the base TPP, enabling sequential learning of event streams without the need for buffering past examples or task-specific attributes. Specifically, this approach consists of a base TPP model, a pool of continuous-time retrieval prompts, and a prompt-event interaction layer. By addressing the challenges associated with modeling streaming event sequences, this mode enhances the model's performance.
        \item HYPRO~\citep{xue2022hypro}: The hybridly normalized probabilistic (HYPRO) model is capable of making long-horizon predictions for event sequences. This model consists of two modules: the first module is an auto-regressive base TPP model that generates prediction proposals, while the second module is an energy function that assigns weights to the proposals, prioritizing more realistic predictions with higher probabilities. This design effectively mitigates the cascading errors commonly experienced by auto-regressive TPP models in prediction tasks, thereby improving the model's accuracy in long-term forecasting.
        \item MLE-SGL~\citep{xu2016learning}: It proposes an effective method to learn the Granger causality for Hawkes process. The model represents impact functions using a series of basis functions and recovers the Granger causality graph via group sparsity of the impact functions’ coefficients. The proposed learning algorithm combines a maximum likelihood estimator (MLE) with a sparse group-lasso (SGL) regularizer. Additionally, the flexibility of the model allows to incorporate the clustering structure event types into learning framework.
        \item GC-CGD~\citep{wei2022granger}: This work proposes a linear Hawkes process model, coupled with ReLU link function to recover a Granger Causal graph with both exciting and inhibiting effects. The method is a scalable two-phase gradient-based method to obtain a maximum surrogate-likelihood estimator. In the first phase, it constrains all parameters to be non-negative and perform projected gradient descent with fixed step length. In the second phase, it performs batch coordinate gradient descent on those variables whose corresponding rows (in the trigerring effect matrix) could have negative values.
    \end{itemize}

\end{itemize}






\subsection{Computing Infrastructure}
All the experiments for both synthetic dataset experiments and real-world dataset experiments, including the comparison experiments with baselines, are performed on Ubuntu 20.04.3 LTS system with Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz, 227 Gigabyte memory.

\subsection{Hyper-Parameters Selection}
\label{appendix_subsection:hyper_param_selection}
Our model is easy to implement and reproduce the results. We present the selected hyper-parameters on synthetic and real-world datasets
in Table \ref{tab:hyperparameter_selection}. The hyper-parameter selection metric is a trade-off between training converged log-likelihood, prediction performance, and time efficiency.

\begin{table*}[ht]
\centering
\caption{Descriptions and values of hyper-parameters used for models trained on the synthetic and real-world datasets.}
\begin{tabular}{ccccc}
\hline 
Hyper-parameters & \multicolumn{3}{c}{Value Used} \\
\hline
& Syn-Data (Uni-Modal) 
& Syn-Data (Multi-Modal) 
& MIMIC-IV & Covid Policy Tracker
\\
\hline
Max Epochs & 128 & 128 & 256 & 256 \\  
Batch Size & 64 & 64 & 64 & 64 \\
Hidden Size & 32 & 32 & 32 & 32 \\
$\#$ NFs Ensembled & 2 & 2 & 3 & 3 \\
$\#$ Layers for single NF & 6 & 6 & 8 & 8 \\
$\#$ Samples for single NF & 100 & 100 & 100 & 100 \\
Base Dist. & $\mathcal{N}(0, 1)$ & $\mathcal{N}(0, 1)$ & $\mathcal{N}(0, 1)$ & $\mathcal{N}(0, 1)$ \\
Learning Rate & 1e-3 & 1e-3 & 5e-4 & 5e-4 \\ 
Optimizer & Adam & Adam & Adam & Adam \\
Flow Model & RealNVP & RealNVP & ResFlow & RealNVP \\
\hline
\end{tabular}
\label{tab:hyperparameter_selection}
\end{table*}



\section{More Synthetic Dataset Experiments}
\label{appendix:more_synthetic_data_experiment}
\subsection{Complete visualization Examples Comparing the Performance on Parameter Distribution Learning Tasks}
\label{appendix_subsec:complete_comparsion_distributions}
In Figure \ref{fig:example_true_density_learned_density_complete_results}, we report the complete visualization results of the learned marginal distribution for target dimension ($u=3$), e.g., $\alpha_{31}, \alpha_{32}, \alpha_{33}$ and $\delta_{31}, \delta_{32}, \delta_{33}$, using 3-dimensional datasets with 7500 samples. The results demonstrate that our model not only accurately captures uni-modal distributions but also performs well on multi-modal distributions, significantly outperforming Hypernet and $\beta$-VAE.

To test the scalability and fairly compare different deep generative models for learning parameter distributions in our problem setting, we vary the size of training samples within $\left \{ 2500, 5000, 7500, 10000, 12500, 15000 \right \}$. The results are shown in Table \ref{tab:compare_accuracy}, from which one can observe that our model consistently outperforms Hypernet and $\beta$-VAE in almost all cases.

\begin{figure*}[htb] 
\centering 
\includegraphics[width=1.0\textwidth]{UAI 2025 Camera-Ready/Fig/compare_true_density_vs_learned_density_joint.pdf}
\caption{Visualization examples comparing different models on parameter distribution learning tasks with 3-dimensional datasets and 7500 samples. We report the learned marginal distribution for target dimension ($u=3$) in these figures.}
\label{fig:example_true_density_learned_density_complete_results} 
\end{figure*}

\begin{table*}[ht]
\centering
\caption{Compare the accuracy of learned parameter distributions across different models using \textbf{KL divergence} as metric with varying sample sizes. Bold signifies the best result, while underlined text indicates the second-best result.}
\begin{tabular}{c|cccccc|cccccc}
\hline 
\multirow{2}{*}{Model} & \multicolumn{6}{c|}{Uni-Modal} & \multicolumn{6}{c}{Multi-Modal} \\
\cline{2-13} 
& 2500 & 5000 & 7500 
& 10000 & 12500 & 15000 & 2500 & 5000 & 7500 
& 10000 & 12500 & 15000
 \\
\hline
Hypernet ($\alpha$) & 5.51 & 8.22 & 6.70 & 5.08 & 5.09 & 5.30 & 39.68& 37.92& 33.95 & 33.53& 33.30&32.61
\\
$\beta$-VAE ($\alpha$) & \underline{1.83} & \textbf{1.29} & \underline{1.53} & \underline{1.26} & \underline{1.24} & \underline{1.19} & \underline{5.12} & \underline{4.55} & \underline{4.42} & \textbf{3.23} & \underline{2.94} & \underline{2.87}
\\
% \rowcolor{ours}
\textbf{Ours*} ($\alpha$) & \textbf{1.48} & \underline{1.42} & \textbf{1.22} & \textbf{1.17} & \textbf{1.05} & \textbf{0.91} & \textbf{4.62} & \textbf{4.18} & \textbf{3.16} & \underline{3.38} & \textbf{2.79} & \textbf{2.52}
\\
\hline
Hypernet ($\delta$) & {6.22}  & \underline{3.54} & 3.99 & \underline{1.57} & 1.47 & \textbf{1.38}& 14.79 & 11.74 & 7.86 & 5.95 & 11.24 & 8.60
\\
$\beta$-VAE ($\delta$) & \textbf{3.62} & {4.17} & \underline{3.22} & {1.69} & \textbf{1.37} & {1.52} & \textbf{2.88} & \textbf{2.74} & \underline{2.43} & \underline{2.37} & \underline{2.19} & \underline{2.10}
\\
% \rowcolor{ours}
\textbf{Ours*} ($\delta$) & \underline{3.83} & \textbf{2.57} & \textbf{2.16} & \textbf{1.56} & \underline{1.43} & \underline{1.45} & \underline{2.92} & \underline{2.85} & \textbf{2.25} & \textbf{2.29} & \textbf{1.86} & \textbf{1.63} 
\\
\hline
\end{tabular}
\label{tab:compare_accuracy}
\end{table*}

As we build joint distributions of the parameters in the
generating process of synthetic datasets, our flow-based
model inherently can effectively capture their dependencies. Depicted in Figure \ref{fig:joint_distribution_and_samples}, the samples of $\boldsymbol{\alpha}$ and $\boldsymbol{\delta}$ from our well-trained model basically match ground truth joint densities.
\begin{figure}[ht] 
\centering 
\includegraphics[width=0.48\textwidth]{UAI 2025 Camera-Ready/Fig/joint_density_and_samples.pdf}
\caption{True joint distribution (\textbf{contours}) of $\alpha_{31}$ and $\delta_{31}$ and samples (\textbf{red circles}) from our well-trained model using multi-modal dataset with 3-dimensional and 7500 samples.}
\label{fig:joint_distribution_and_samples} 
\end{figure}

\subsection{Compare with Other Flow-Based Models}
\label{appendix_subsec:compare_different_flow_models}

In Table \ref{tab:selection_nf}, we compare the model performance produced by different flow-based models. In our setting, simple normalizing flow models like RealNVP are capable of uncovering ground truth parameter distributions. When employing dense flows, although there is an enhancement in model performance, it requires excessive computational
resources. In practical applications, we must strike a balance between model effectiveness and training efficiency. Taking into account factors such as data volume and dimensionality, our
model can select suitable normalizing flow models. Detailed selections of flow models for synthetic datasets and real-world datasets experiments can be found in Appendix \ref{appendix_subsection:hyper_param_selection}.

\begin{table}[ht]
\centering
\caption{Compare different normalizing flow models. We take uni-modal distribution dataset with 3 dimensions and 7500 samples as an example.}
\begin{tabular}{ccccc}
\hline 
 Model & NLL $\downarrow$ & aKL ($\alpha$) & aKL ($\delta$)
& Time $\downarrow$ \\
\hline
Planerr~\citep{rezende2015variational} & 29.52 & 1.56 & 2.83 & \textbf{0.18h} \\
Glow~\citep{kingma2018glow} & 25.70 & 1.32 & 2.28 & 0.38h \\
% \rowcolor{gray!50}
RealNVP~\citep{dinh2016density} & \underline{25.26} & \underline{1.22} & \textbf{2.16} & \underline{0.21h}  \\
RQ-NSF~\citep{durkan2019neural} & 25.54 & 1.24 & 2.23 & 0.27h \\
ResFlow~\citep{chen2019residual} & \textbf{24.93} & \textbf{1.18} & \underline{2.20} & 0.56h \\
 \hline
\end{tabular}

\label{tab:selection_nf}
\end{table}


\subsection{Compare with Traditional Parametric Models}
\label{appendix_subsec:compare_traditional_prob_models}
Our use of normalizing flows as complex priors offers two key advantages. The first one is flexible modeling of any irregular distributions (e.g., skewed patterns) that capture population variance complexity. The second one is stronger exploration and expressive power compared to traditional parametric models -- though requiring larger datasets. To further illustrate this, we have added more experiments: we consider conventional probabilistic models with simple priors, including mixture of uniform and mixture of gaussian models (abbreviated as ``MoU'' and ``MoG'' respectively). As shown in Table~\ref{tab:compare_other_prob_model_syn_data} and Table~\ref{tab:compare_other_prob_model_real_data}, our flow-based model demonstrates consistent superiority over traditional approaches across synthetic dataset distribution learning and real-world prediction tasks, compared with mixture of uniform and mixture of gaussian models. Furthermore, our model exhibits improved performance with increased training data, as quantitatively verified by the lower KL divergence.

\begin{table}[ht]
\centering
\caption{Compare the distribution learning ability using varying training sample size (on multi-modal synthetic datasets) between our flow-based model and other mixture models. The comparison metric is the average KL divergence between learned distributions and ground truth distributions.}
\begin{tabular}{c|ccc|ccc}
\hline 
\multirow{2}{*}{Metric} & \multicolumn{3}{c|}{aKL ($\alpha$)} & \multicolumn{3}{c}{aKL ($\delta$)} \\
\cline{2-7} 
& 2500 & 7500 & 12500 & 2500 & 7500 & 12500 \\
\hline
MoU & 38.27 & 35.10 & 34.33 & 21.82 & 15.40 & 13.36 \\
MoG & \underline{5.45} & \underline{5.08} & \underline{4.93} & \underline{3.79} & \underline{3.21} & \underline{2.85} \\
% \rowcolor{ours}
\textbf{Ours*} & \textbf{4.62} & \textbf{3.16} & \textbf{2.79} & \textbf{2.92} & \textbf{2.25} & \textbf{1.86} \\ 
\hline
\end{tabular}
\label{tab:compare_other_prob_model_syn_data}
\end{table}


\begin{table}[ht]
\centering
\caption{Compare the prediction performance between our flow-based model and other mixture models using two real-world datasets.}
\begin{tabular}{c|cc|cc}
\hline 
\multirow{2}{*}{Metric} & \multicolumn{2}{c|}{MIMIC-IV} & \multicolumn{2}{c}{Covid Polcy} \\
\cline{2-5} 
& NLL $\downarrow$ & RMSE $\downarrow$ & NLL $\downarrow$ & RMSE $\downarrow$ \\
\hline
MoU & 28.90 & 4.13 & 43.67 & 4.28 \\
MoG & \underline{24.25} & \underline{3.54} & \underline{39.02} & \underline{3.80} \\
% \rowcolor{ours}
\textbf{Ours*} & \textbf{21.32} & \textbf{2.86} & \textbf{36.94} & \textbf{3.35} \\ 
\hline
\end{tabular}
\label{tab:compare_other_prob_model_real_data}
\end{table}


\subsection{Experiments on Synthetic Datasets with Varying Decay}
\label{appendix_subsec:compare_performance_vary_beta}
While estimating extra distributions add computational complexity, it is still achievable. We have extended our synthetic datasets with varying $\beta$ for different event types. While we observe marginal accuracy declines in learning the distributions of $\alpha$, $\beta$, and $\delta$ on datasets with varying $\beta$ distributions, the overall performance remains satisfactory. Importantly, our flow-based model consistently demonstrates superior performance compared to Hypernet and $\beta$-VAE across all experimental settings in parameter distribution learning tasks, and our flow-based model also outperforms other baseline models in prediction tasks, as shown in Table~\ref{tab:syn_data_vary_beta_dist_learning} and Table~\ref{tab:syn_data_vary_beta_prediction} respectively. 

\begin{table}[ht]
\centering
\caption{Compare the accuracy of learned parameter distributions across different models using KL divergence as metric with varying sample sizes for different setting of decay ($\beta$) parameter. Here we focus on multi-modal distributions.}
\begin{tabular}{c|ccc|ccc}
\hline 
\multirow{2}{*}{Synthetic} & \multicolumn{3}{c|}{Shared $\beta$} & \multicolumn{3}{c}{Varied $\beta$} \\
\cline{2-7} 
& 2500 & 7500 & 12500 & 2500 & 7500 & 12500\\
\hline
Hypernet ($\alpha$) & 39.68 & 33.95 & 33.30 & 40.12 & 39.61 & 39.30 \\
$\beta$-VAE ($\alpha$) & \underline{5.12} & \underline{4.42} & \underline{2.94} & \underline{5.60} & \underline{5.29} & \underline{4.72} \\
% \rowcolor{ours}
\textbf{Ours*} ($\alpha$) & \textbf{4.62} & \textbf{3.16} & \textbf{2.79} & \textbf{4.93} & \textbf{3.75} & \textbf{3.38} \\
\hline
Hypernet ($\beta$) & \underline{39.82} & \underline{29.82} & \underline{27.41} & \underline{65.92} & \underline{62.56} & \underline{59.63} \\
$\beta$-VAE ($\beta$) & 5.94 & 5.52 & 5.17 & 6.30 & 5.89 & 5.22 \\
% \rowcolor{ours}
\textbf{Ours*} ($\beta$) & \textbf{4.82} & \textbf{4.55} & \textbf{3.76} & \textbf{5.13} & \textbf{4.79} & \textbf{4.21} \\
\hline
Hypernet ($\delta$) & 14.79 & 7.89 & 11.24 & 39.47 & 39.20 & 38.67 \\
$\beta$-VAE ($\delta$) & \underline{2.88} & \underline{2.43} & \underline{2.19} & \underline{3.61} & \underline{3.42} & \underline{3.10} \\
% \rowcolor{ours}
\textbf{Ours*} ($\delta$) & \textbf{2.92} & \textbf{2.25} & \textbf{1.86} & \textbf{3.18} & \textbf{2.75} & \textbf{2.34} \\
\hline
\end{tabular}
\label{tab:syn_data_vary_beta_dist_learning}
\end{table}

\begin{table}[ht]
\centering
\caption{Prediction tasks on synthetic datasets for varied and shared decay parameter. Here we focus on multi-modal distributions.}
\begin{tabular}{cccccc}
\hline 
\multicolumn{2}{c}{Synthetic Datasets} & \multicolumn{2}{c}{Shared $\beta$} & \multicolumn{2}{c}{Varied $\beta$} \\
\hline
Category & Method & NLL $\downarrow$ & RMSE $\downarrow$ & NLL $\downarrow$ & RMSE $\downarrow$ \\
\cline{1-6} 
 & GM-NLF & 34.27 & 2.72 & 34.47 & 3.15 \\
Non-Param. & MMEL & 34.55 & 2.85 & 34.98 & 3.27 \\
  & Gibbs-Hawkes & 33.86 & 2.64 & 34.50 & 3.04 \\
\hline
 & RMTPP & 32.67 & 2.77 & 34.19 & 2.90 \\
 & THP & \underline{32.10} & 2.46 & 33.80 & 2.86 \\ 
 & PromptTPP & 32.52 & \underline{2.40} & \underline{33.67} & \underline{2.79} \\
Param. & HYPRO & -- & 2.37 & -- & 2.83 \\
 & MLE-SGL & 33.78 & 2.57 & 34.52 & 3.22 \\
 & GC-CGD & 33.54 & 2.45 & 34.73 & 3.05 \\ 
% \rowcolor{ours}
& \textbf{Ours*} & \textbf{30.42} & \textbf{2.25} & \textbf{32.25} & \textbf{2.68} \\
\hline
\end{tabular}
\label{tab:syn_data_vary_beta_prediction}
\end{table}

\subsection{Scalability Experiments}
\label{appendix_subsection:scalability}

To evaluate the scalability of our proposed model, we vary the dimensionality within $\left \{ 2, 3, 5, 7, 9 \right \}$ and sample sizes within $\left \{ 2500, 5000, 7500, 10000, 12500, 15000 \right \}$. Our model demonstrates high efficiency and good scalability, converging within 1.2 hours even in the most complex scenarios, utilizing 15000
samples of training data with 9 dimensions. Shown in Figure \ref{fig:scalability}, as the
training sample size increases, the training time increases
while the converged negative log-likelihood decreases, and
distribution learning accuracy increases accordingly. As the dimensionality of Hawkes processes increases, the distribution learning accuracy of our model may slightly decrease but remains satisfactory. Encouragingly, as the training sample size grows, the learning performance becomes stable.

\begin{figure*}[htb] 
\centering 
\includegraphics[width=1.0\textwidth]{UAI 2025 Camera-Ready/Fig/scalability.pdf}
\caption{Scalability experiments with varying training samples and dimensions. We take uni-modal distribution datasets as examples. All the experiments are conducted over 5 random runs and the standard errors are reflected in the shaded areas.}
\label{fig:scalability} 
\end{figure*}

\section{More Real-World Dataset Experiments}
\label{appendix:more_real_world_data_experiment}
\subsection{Healthcare Data Experiments -- MIMIC-IV} 
\label{appendix_subsection:mimic_iv}
MIMIC-IV\footnote{\url{https://mimic.mit.edu/}} is a publicly available database sourced from the electronic health record of the Beth Israel Deaconess Medical Center~\citep{johnson2023mimic}. Available information includes patient measurements, orders, diagnoses, procedures, treatments, and deidentified free-text clinical notes. Sepsis is a leading cause of mortality in the ICU, particularly when it progresses to septic shock. Septic shocks are critical medical emergencies, and timely recognition and treatment are crucial for improving survival rates. In the real-world healthcare data experiments on MIMIC-IV dataset, we aim to uncover the delay effect of the treatments related to septic shocks for the whole patient samples.

\paragraph{Patients} We select 1943 patients that satisfied the following criteria from the dataset: (i) The patients are diagnosed with sepsis~\citep{saria2018individualized}. (ii) Patients, if diagnosed with sepsis, the timestamps of any clinical testing and timestamps of medication administration and corresponding dosage were not missing.

\paragraph{Treatment} Suggested by \citet{komorowski2018artificial}, we extracted 21 treatment associated with sepsis which are consistent with expert consensus. Based on the distinct clinical characteristics of these treatments, they can be categorized into the following three groups, which are shown in Table \ref{appendix:tab_mimic_iv_description}. Vasopressor therapy is a fundamental treatment of septic-shock-induced hypotension as it aims at correcting the vascular tone depression and then improving organ perfusion pressure; Antibiotics also should be given within a few hours of the diagnosis of sepsis; Some auxiliary treatments such as packed red blood cells and invasive ventilation are also necessary in ICU.

\paragraph{Outcome} We treated real-time urine as the outcome indicator since low urine is the direct indicator of bad circulatory systems and the signal for septic shock. In contrast, normal urine reflects the effect 
of the drugs and treatments and the improvement of the patients’ physical condition. Some treatments will have a rapid effect on the urine while others might take longer to exert an effect. 

\paragraph{Preprocessing} Due to the frequent fluctuations in urine output within the ICU setting, we considered only those instances in which urine output became normal after maintaining an abnormal level for at least 24 hours. These instances were regarded as valid target events that hold significance for prediction and explanation. For each patient, we extracted all the periods that met the criteria. We also documented all the intake time points within the 24 hours leading up to the transition of urine output from abnormal to normal during clinical treatment. The processed data set has 7377 records in total, and we split them by $80\%$, $10\%$, and $10\%$ as the training, evaluation, and testing set. 

\begin{table*}[t]
\centering
\caption{Description of the treatment extracted from MIMIC-IV dataset.}
\begin{tabular}{c|l}
\hline
\multicolumn{1}{c|}{\textbf{Category}} & \multicolumn{1}{c}{\textbf{Treatment}}  \\
\hline
\multirow{7}{*}{\textbf{Vasoconstrictor}} & Epinephrine \\
\cline{2-2}
~  & Phenylephrine  \\
\cline{2-2}
~  & Norepinephrine \\
\cline{2-2}
~  & Dobutamine  \\
\cline{2-2}
~ & Dopamine  \\
\cline{2-2}
~  & Vasopressin  \\
\cline{2-2}
~  & Angiotensin II (Giapreza)  \\

\hline
\multirow{8}{*}{\textbf{Antibiotic}} & Vancomycin   \\
\cline{2-2}
~ &  Caspofungin  \\
\cline{2-2}
~ & Cefepime   \\
\cline{2-2}
~ & Ceftriaxone   \\
\cline{2-2}
~ & Gentamicin   \\
\cline{2-2}
~ & Micafungin  \\
\cline{2-2}
~ & Tobramycin  \\
\cline{2-2}
~ & Piperacillin/Tazobactam   \\

\hline
\multirow{6}{*}{\textbf{Auxiliary Treatment}} & Furosemide (Lasix)   \\
\cline{2-2}
~ & Heparin Sodium   \\
\cline{2-2}
~ & Invasive Ventilation  \\
\cline{2-2}
~ & Packed Red Blood Cells  \\
\cline{2-2}
~ & IV Immune Globulin (IVIG)  \\
\cline{2-2}
~ &  Acetaminophen-IV  \\

\hline
\end{tabular}
\label{appendix:tab_mimic_iv_description}
\end{table*}


\subsection{Healthcare Data Experiments -- Covid Policy} 
\paragraph{Policy Information}
\label{appendix_subsection:covid_policy_tracker}
The descriptions of the policies of the two countries considered in our experiments (Australia and France) are summarized in Table \ref{appendix:tab_policy_code_description}. In Table \ref{appendix:tab_sumarize_7_nations results}, we tick the policy for these two countries if it appears in the datasets. 

\paragraph{Preprocessing} We aim to investigate the impact of the policies of different categories on the daily average number of confirmed cases. We tallied the cumulative confirmed cases over 7 consecutive days to capture the epidemic spread trend to avoid daily noise. To understand the waiting time for each policy to work, we marked the date when confirmed cases started decreasing as a ``dropping infection event''. We conducted experiments on data from Australia and France for the years 2021-2022 based on the most severe COVID-19 situations and effective governmental policies. For each dataset of country, we split them by $80\%$, $10\%$, and $10\%$ as the training, evaluation, and testing set.


\paragraph{Experiment Results}
In Figure \ref{fig:delay_effect_covid_policy_Australia} and Figure \ref{fig:delay_effect_covid_policy_France}, we visualize the learned distributions and samples from well-trained models for Australia and France. Overall, the lag for the effectiveness of government policies in Australia seems shorter than in France. The positive impact of containment and closure policies is normally distributed and is almost larger than all other policies in these two countries. In Australia, when government enforces Containment Closure policies, the population generally splits into two groups (exhibiting two peaks in the distribution). One group promptly complies with isolation measures, leading to a decrease in new cases of COVID-19 around 5 days after policy implementation. The other group responds more slowly, requiring approximately 6.5 days for the policies to take effect. The pattern of Containment Closure policies in France is different, roughly following a normal distribution with a mean of 7.5 days. In terms of Healthcare System policies, delay effects in Australia also exhibit a bimodal distribution, with peaks at around 7 and 9 days, whereas in France with longer onset times, approximately 7.5 and close to 11 days, respectively, but the variance is smaller. In both countries, the impact of Vaccination and Economic policies is smaller than that of the two policies mentioned above. In Australia, the time lag for Vaccination to take effect typically peaks at 10 and 11 days, while for Economic policies, the mean time lag is 14.5 days. In France, the mean time lags are approximately 11.5 and 15 days, respectively, for these two policies.

\begin{figure}[ht] 
\centering 
\includegraphics[width=0.6\textwidth]{UAI 2025 Camera-Ready/Fig/delay_effect_covid_19_Australia_new.pdf}
\caption{Learned impact $\boldsymbol{\alpha}$ and delay effect $\boldsymbol{\delta}$ distributions for Covid Policy dataset in Australia (\textbf{left and bottom bar plots}) and 2D scatter plot for samples of these two parameters (\textbf{top-right}).}
\label{fig:delay_effect_covid_policy_Australia} 
\end{figure}

\begin{figure}[ht] 
\centering 
\includegraphics[width=0.6\textwidth]{UAI 2025 Camera-Ready/Fig/delay_effect_covid_19_France_new.pdf}
\caption{Learned impact $\boldsymbol{\alpha}$ and delay effect $\boldsymbol{\delta}$ distributions for Covid Policy dataset in France (\textbf{left and bottom bar plots}) and 2D scatter plot for samples of these two parameters (\textbf{top-right}).}
\label{fig:delay_effect_covid_policy_France} 
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\centering
\caption{Policies description of each code.}
\begin{tabular}{l|c|l}
\hline
\multicolumn{1}{c|}{\textbf{Category}} & \textbf{Code} & \multicolumn{1}{c}{\textbf{Explain}} \\
\hline
\multirow{8}{*}{\textbf{Containment \& closure policies}} & C1 & School closing.\\
\cline{2-3}
 & C2 & Workplace closing.\\
\cline{2-3}
 & C3 & Cancel public events.\\
\cline{2-3}
 & C4 & Restrictions on gatherings.\\
\cline{2-3}
 & C5 & Close public transport.\\
\cline{2-3}
 & C6 & Stay at home requirements.\\
\cline{2-3}
 & C7 & Restrictions on internal movement.\\
\cline{2-3}
 & C8 & International travel controls.\\
\hline
\multirow{4}{*}{\textbf{Health system policies}} & H1 & Public information campaigns.\\
\cline{2-3}
 & H2 & Testing policy.\\
\cline{2-3}
 & H3 & Contact tracing.\\
\cline{2-3}
 & H4 & Emergency investment in healthcare.\\
% \cline{2-3}
%  & H5 & Investment in vaccines.\\
% \cline{2-3}
%  & H6 & Facial coverings.\\
% \cline{2-3}
%  & H7 & Vaccination policy.\\
% \cline{2-3}
%  & H8 & Protection of elderly people.\\
\hline
\multirow{4}{*}{\textbf{Vaccination policies}} & V1 & Vaccine prioritisation.\\
\cline{2-3}
 & V2 & Vaccine eligibility/availability.\\
\cline{2-3}
 & V3 & Vaccine financial support.\\
\cline{2-3}
 & V4 & Mandatory Vaccination.\\
\hline
\multirow{4}{*}{\textbf{Economic policies}} & E1 & Income support.\\
\cline{2-3}
 & E2 & Debt/contract relief.\\
 \cline{2-3}
 & E3 & Fiscal measures.\\
 \cline{2-3}
 & E4 & International support.\\
 \hline
% \textbf{Miscellaneous policies} & M1 & Record other policy announcements.\\
% \hline
\end{tabular}

\label{appendix:tab_policy_code_description}
\end{table*}


%%%%%%%%%%%%%%%%%%%%%%%%
\begin{table*}[ht]
\centering
% \footnotesize
\caption{The implemented policies for Australia (AUS) and France in 2021-2022.}
% \footnotesize % 将字体大小设置为小号
\setlength{\tabcolsep}{5pt}
\begin{tabular}{c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c}
\hline
\multirow{3}{*}{\textbf{Nations}} & \multicolumn{18}{c}{\textbf{Policies}}\\
\cline{2-21}
 &\multicolumn{8}{c|}{Containment \& closure}&\multicolumn{4}{c|}{Health system}&\multicolumn{4}{c|}{Vaccination}&\multicolumn{4}{c}{Economic}\\
\cline{2-21}
 & C1 & C2 & C3 & C4 & C5 & C6 & C7 & C8 & H1 & H2 & H3 & H4 & V1 & V2 & V3 & V4 & E1 & E2 & E3 & E4\\
\hline
% \textbf{China} &\checkmark  & \checkmark & \checkmark & \checkmark &\checkmark & \checkmark & \checkmark & \checkmark & \checkmark &\checkmark & \checkmark &\checkmark & \checkmark & \checkmark &\checkmark &\checkmark &  & \\
% \hline
\textbf{AUS} & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark  & \checkmark &\checkmark & \checkmark & \checkmark & \checkmark & \checkmark &\checkmark & \checkmark &\checkmark & \checkmark &\checkmark &  & \checkmark &  &\\
\hline
% \textbf{SK} & \checkmark & \checkmark & \checkmark & \checkmark &  &  & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark \\
% \hline
% \textbf{UK} & \checkmark & \checkmark & \checkmark &\checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & & \checkmark \\
% \hline
% \textbf{US} & \checkmark & \checkmark & \checkmark &\checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark\\
% \hline
% \textbf{Italy} & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark &\checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark \\
% \hline
\textbf{France} & \checkmark & \checkmark & \checkmark & \checkmark & & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & \checkmark  & \checkmark &  & \\
\hline
% \multicolumn{19}{l}{{\textit{Note}: \textbf{AUS} for Australia}}\\
% \hline
\end{tabular}
\label{appendix:tab_sumarize_7_nations results}
\end{table*}


\subsection{Other Real-World Dataset Experiments}
\label{appendix_subsection:other_real_world_dataset_experiments}

we further extended our evaluation to additional datasets (StackOverflow, Taobao, and Taxi) by predicting both the next event type and its time, while incorporating per-event Negative Log-Likelihood (NLL) (lower is better) and event type prediction error rate (lower is better) as extra evaluation metrics, as well as RMSE. As shown in Table~\ref{tab:experiment_other_real_world_dataset} below, our model maintains strong performance across these baselines on almost all datasets, except for Taxi dataset. This also suggests that StackOverlow and Taobao datasets may inherently exhibit delayed event-triggering effects (while Taxi dataset may not), further validating our approach.


\begin{table}[ht]
\centering
\caption{Experiments on three newly introduced real-world dataset. Note that we report per-event negative log-likelihood (NLL $\downarrow$), event type prediction error rate (ER$\%$ $\downarrow$), and event time prediction root mean square error (RMSE $\downarrow$) for these new real-world dataset.}
% \footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{ccccccccccc}
\hline 
  \multicolumn{2}{c}{Real-World Datasets} & \multicolumn{3}{c}{StackOverflow} & \multicolumn{3}{c}{Taobao} & \multicolumn{3}{c}{Taxi} \\
\hline
Category & Method & NLL $\downarrow$ & ER $\%$ $\downarrow$ & RMSE $\downarrow$ & NLL $\downarrow$ & ER $\%$ $\downarrow$ & RMSE $\downarrow$ & NLL $\downarrow$ & ER $\%$ $\downarrow$ & RMSE $\downarrow$ \\
\cline{1-11} 
 & GM-NLF & 2.88 & 59.20 & 1.39 & 1.68 & 57.81 & 0.84 & 0.50 & 23.62 & 0.68 \\
Non-Param. & MMEL & 2.95 & 58.74 & 1.66 & 1.52 & 59.23 & 0.82 & 0.55 & 19.92 & 0.70 \\
  & Gibbs-Hawkes & 2.90 & 58.81 & 1.62 & 1.40 & 60.08 & 0.84 & 0.53 & 21.14 & 0.73 \\
\hline
 & RMTPP & 2.83 & 56.85 & 1.38 & 1.64 & 57.20 & 0.76 & \underline{0.35} & 16.43 & 0.53 \\
 & THP & \underline{2.68} & 52.73 & 1.38 & \underline{1.22} & 53.38 & 0.73 & 0.48 & 13.28 & \underline{0.46} \\ 
 & PromptTPP & 2.71 & \underline{51.53} & 1.37 & 1.25 & 54.26 & \underline{0.67} & 0.44 & \textbf{13.15} & \textbf{0.43} \\
Param. & HYPRO & -- & 51.70 & \underline{1.35} & -- & \underline{52.37} & 0.69 & -- & \underline{13.26} & 0.47 \\
 & MLE-SGL & 3.12 & 58.34 & 1.43 & 1.26 & 58.40 & 0.79 & 0.38 & 17.95 & 0.68 \\
 & GC-CGD & 3.04 & 57.36 & 1.40 & 1.38 & 55.89 & 	0.73 & \textbf{0.32} & 17.22 & 0.64 \\ 
% \rowcolor{ours}
& \textbf{Ours*} & \textbf{2.64} & \textbf{51.22} & \textbf{1.33} & \textbf{1.18} & \textbf{52.06} & \textbf{0.64} & 0.52 & 16.08 & 0.56 \\
\hline
\end{tabular}
\label{tab:experiment_other_real_world_dataset}
\end{table}



\section{Limitations and Broader Impacts}
\label{appendix:limitations_broader_impacts}
We assume that $\beta$ is shared across all event types in our paper. In fact, estimating the decay parameter of the Hawkes process is inherently challenging; only a limited number of studies \citep{santos2023bayesian} addressing this task, demonstrate that the estimation difficulties relate to the noisy, non-convex shape of the log-likelihood of Hawkes process as a function of the decay. Yet, our proposed model can easily extend to the estimation of the decay parameter distributions, but stability needs to be improved. We can also attempt to perform our method on more forms of triggering kernels for Hawkes process to validate the performance, such as Gamma kernel, Weibull-based kernel, power-law kernel, and so on.

Our proposed model can infer the time-lag distributions which are of scientific meaning and help trace the original causal time that supports the root cause analysis. In healthcare, inferring the distributions of time lags and other parameters that affect drug efficacy can assist clinicians in identifying the timing of drug effects. This information enables them to develop more effective treatment strategies for patients. In a pandemic, our model can help decision-makers and citizens understand governmental responses consistently, aiding efforts to fight the pandemic. However, this requires our algorithm to provide high accuracy. In clinical applications, our method can serve as a reference for inexperienced novice doctors, providing them with valuable guidance.



\end{document}
