% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%The UAI 2022 paper style is based on a custom \textsf{uai2022} class.
%The class file sets the page geometry and visual style.\footnote{%
%    The class uses the packages \textsf{adjustbox}, \textsf{environ}, \textsf{letltxmacro}, \textsf{geometry}, \textsf{footmisc}, \textsf{caption}, \textsf{textcase}, \textsf{titlesec}, \textsf{titling}, \textsf{authblk}, \textsf{enumitem}, \textsf{microtype}, \textsf{lastpage}, and \textsf{kvoptions}.
%}
%The class file also loads basic text fonts.\footnote{%
%    Fonts loaded are \textsf{times} (roman), \textsf{helvet} (sanserif), \textsf{courier} (fixed-width), and \textsf{textcomp} (common symbols).
%}
%\emph{You may not modify the geometry or style in any way, for example, to squeeze out a little bit of extra space.}
%(Also do not use \verb|\vspace| for this.)
%Feel free to use convenience functionality of loaded packages such as \textsf{enumitem}.
%The class enables hyperlinking by loading the \textsf{hyperref} package.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsmath, amsfonts, amsthm, amssymb, here, dsfont, hyperref}
\usepackage{graphicx,color,subfigure,multirow, here}
\usepackage[utf8]{inputenc} 
\usepackage[T1]{fontenc}
\usepackage{enumitem}
\usepackage{url}
\usepackage{algorithm} 
\usepackage{algorithmic}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newtheorem{theo}{Theorem}[section]
\newtheorem{definition}[theo]{Definition}
\newtheorem{prop}[theo]{Proposition}
\newtheorem{propri}[theo]{Property}
\newtheorem{coro}[theo]{Corollary}
\newtheorem{lemme}[theo]{Lemma}
\newtheorem{rem}[theo]{Remark}
\newtheorem{ex}[theo]{Example}
\newtheorem{ass}[theo]{Assumption}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\argmax}[1]{\underset{#1}{\operatorname{arg}\!\operatorname{max}}\;}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{arg}\!\operatorname{min}}\;}
\newcommand{\n}{\noindent }
\newcommand{\w}{\widehat}
\newcommand{\wt}{\widetilde}
\newcommand{\one}{\mathds{1}}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\cB}{\mathcal{B}}
\newcommand{\B}{\mathbb{B}}
\newcommand{\cC}{\mathcal{C}}
\newcommand{\cD}{\mathcal{D}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\cE}{\mathcal{E}}
\newcommand{\cF}{\mathcal{F}}
\newcommand{\F}{\mathbb{F}}
\newcommand{\cH}{\mathcal{H}}
\newcommand{\cP}{\mathcal{P}}
\renewcommand{\P}{\mathbb{P}}
\newcommand{\cR}{\mathcal{R}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\cT}{\mathcal{T}}
\newcommand{\cY}{\mathcal{Y}}
\newcommand{\dd}{\text{{\rm d}}}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


















%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Multiclass Classification for Hawkes Processes}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{christophe.denis@univ-eiffel.fr}{Christophe Denis}{}}
\author[2]{\href{charlotte.dion_blanc@sorbonne-universite.fr.fr}{Charlotte Dion-Blanc}}
\author[3]{\href{aure.sansonnet@agroparistech.fr}{Laure Sansonnet}}
% Add affiliations after the authors
\affil[1]{%
    LAMA\\
    Université Gustave Eiffel\\
    France
}
\affil[2]{%
    LPSM\\
    Sorbonne Université\\
    France
}
\affil[3]{%
    AgroParisTech, MIA-Paris-Saclay\\
    Université Paris-Saclay\\
    France
  }
  
  \begin{document}
\maketitle

\begin{abstract}
We investigate the multiclass classification problem where the features are event sequences. More precisely, the data are assumed to be generated by a mixture of simple linear Hawkes processes.
In this new setting, the classes are discriminated by various triggering kernels. A challenge is then to build an efficient classification procedure. We derive the optimal Bayes rule and provide a two-step estimation procedure of the Bayes classifier. In the first step, the weights of the mixture are estimated; in the second step, an empirical risk minimization procedure is performed to estimate the parameters of the Hawkes processes.
We establish the consistency of the resulting procedure and derive rates of convergence.
Finally, the numerical properties of the data-driven algorithm are illustrated through a simulation study where the triggering kernels are assumed to belong to the popular parametric exponential family. It highlights the accuracy and the robustness of the proposed algorithm. In particular, even if the underlying kernels are misspecified, the procedure exhibits good performance.
\end{abstract}

%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:intro}
%%%%%%%%%%%%%%%%%%%%

A crucial challenge in multiclass learning is to provide algorithms designed to handle temporal data. In the present paper, we tackle the multiclass classification problem where the features are time event sequences. 
More precisely, we assume that the data come from a mixture of Hawkes processes and we focus on the classification per trajectory (and not per event).


In neuroscience, we can consider event sequences as recorded spike trains on several neurons from different populations (healthy or sick subjects, for instance). The goal is then to predict the status (healthy or not) of a new subject from the associated recording~\citep{lambert}.

%%%%% Hawkes et classif
Hawkes processes, originally introduced in \citep{HAWKES71}, are proposed to model tricky event sequences where
the past events influence the future events.
%\sout{In many applications, the sequences can be considered as repetitions from different classes. Indeed, in neurosciences, from recorded time sequences of spiking, for different neurons on different healthy or sick subjects, we propose to learn and predict from a new recording if the patient is sane or not.}
%Another example for this mixture model is the activity of users on social network. 
%\sout{Indeed, Hawkes processes have now an important place on the literature on models for neurosciences} (see \emph{e.g.} \cite{Hansen15}, \cite{ditlevsen2017multi}, \cite{nature})
%\sout{but also in mathematical finance} (see \emph{e.g.} \cite{bacryfinance} for a complete review)
%\sout{or in social network comprehension} (see \emph{e.g.} \cite{zhou}, \cite{twitter}).
%An example for this mixture model is the activity of users on Linkedin network. Indeed, from recorded observations of job hopping events, some classes can be distinguish in the population as senior position, intern position, research job etc. Then, a classification strategy can lead to 
Hawkes processes arise in a wide variety of fields, ranging from neuroscience to finance. In mathematical finance, see {\it e.g.} \citep{bacryfinance} for a complete review; in the social network literature, see {\it e.g.} \citep{classiftwitter} and \citep{qu}. In neuroscience, Hawkes processes have a statistical interest for modeling neuron spike occurrences, see~{\it e.g.} \citep{Hansen15}, \citep{ditlevsen2017multi}, \citep{nature}.

%%%%% Hawkes rapide
%\citep{HO1974},
Seminal work for Hawkes process properties is \citep{BM1996}. Furthermore, there are numerous statistical methods of inference for Hawkes processes. For instance, one can cite \citep{Hansen15}, \citep{BM2016} and more recently \citep{BGM}, or in a Bayesian framework, \citep{bayesian}. Besides, \citep{favetto} focuses on parameter estimation for Hawkes processes from repeated observations in the context of electricity market modeling.
%(and usefulness for modeling electricity market). 

However, the aim of the paper is a multiclass classification task and not the parameter inference. To the best of our knowledge, except the paper of~\citep{classiftwitter}, there is no work which deals with supervised classification for Hawkes processes. In~\citep{classiftwitter}, the authors propose to use multivariate Hawkes processes for classifying sequences of temporal textual data, with an application to rumours %stance classification 
coming from Twitter datasets. They highlight that a model based on Hawkes processes is a competitive approach which takes into account the temporal dynamic of the data. But, they do not provide any theoretical properties.

More recently, \citet{Dutta, tondulkar2020, ram} focus on the question of time classification. Indeed, for classical Twitter example from PHEME dataset, used to do rumor stance classification, the models impose a label on each tweet (each time). The classification setting, with temporal and textual data, is thus a bit different from our framework. Besides, as in~\citep{classiftwitter}, the authors do not provide theoretical properties to support their procedures.


In this work, we observe 
%repetitions of sequences of jump times coming from a
repeatedly jump times coming from the mixture of Hawkes processes, on a fixed time interval $[0,T]$.
%mixture of simple linear Hawkes processes. 
The classes are characterized by different triggering kernels. We first formally define the model and provide the
explicit form of the Bayes classifier in Section~\ref{sec:framework}.
The expression of the Bayes classifier suggests to consider a plug-in 
approach to estimate the optimal predictor. Section~\ref{sec:Plug-inClass} is devoted to the definition of plug-in type classifier and the study of its properties. We show how the misclassification error, for any plug-in predictor is linked to the estimation error of the process parameters.
We propose in Section~\ref{sec:ClassProcedure} a two-step procedure
to build a plug-in type classifier. A first step is dedicated to the estimation of the weights of the mixture. In a second step the parameters of the process are estimated through an empirical risk minimization procedure by using similar ideas as in~\citep{DDM}.
The resulting algorithm benefits from the attractive properties of the empirical risk minimizer: it is computationally efficient and offers good theoretical properties. In particular, under mild assumptions, we show that the proposed procedure performs as well as the Bayes classifier.
Section~\ref{sec:NumExp} illustrates the performance and the robustness of the method in the case where the triggering kernels are assumed to belong to the parametric exponential family. Finally, a discussion which highlights  some directions for future works is proposed in Section~\ref{sec:Dicussion}.

%A previous paper \citep{DDM} develops analogous ideas in the discrete diffusion paths context.  
%We aim to retrieve its intensity function associated among $K$ classes characterized by different triggering kernels.
%We consider the sample $\big((\cT_T^1,Y^1), \ldots, (\cT_T^n,Y^n)\big)$, where $(\cT_T^i,Y^i)$ are independent copies of $(\cT_T,Y)$ with $\cT_T$ a sequence of jump times and $Y\in\cY$ its label. The asymptotic framework is that $n$ goes to infinity, while the observation time $T$ is assumed fixed. 
%, \citep{reynaud2013spike}
%Nevertheless, inference parameters is not our purpose here. The aim of the paper is the classification task. How can we guess the label of a subject from times recording? 
%The first point it that we assume that the observations come from a simple linear Hawkes process with various classes which are discriminated by their kernel function. The second point to notice is that we propose a \textit{plug-in} approach to answer this issue, it means that the classifier relies on estimations of the unknown parameters. The strength of the method is that one step one step is done through empirical risk minimization. A previous paper \citep{DDM} develops analogous ideas in the discrete diffusion paths context. Finally, Section~\ref{sec:NumExp} illustrates the the, together with the robustness of the method.

%To the best of our knowledge, this is the first time that this question is investigated theoretically and numerically for any 
%(or close) 
%kernel functions.
%%%%% Clustering
%For example \citep{twitter} propose to  use (multivariate) Hawkes Processes for classifying sequences of temporal textual data, with an application on rumours %stance classification 
%on Twitter datasets. They highlight that a model based on Hawkes processes is a competitive approach which takes into account the temporal dynamic.
%Precisely, here we observe repetitions of sequences of jump times coming from a simple linear Hawkes process. We aim to retrieve its intensity function associated among $K$ classes characterized by different triggering kernels.
%We consider the sample $\big((\cT_T^1,Y^1), \ldots, (\cT_T^n,Y^n)\big)$, where $(\cT_T^i,Y^i)$ are independent copies of $(\cT_T,Y)$ with $\cT_T$ a sequence of jump times and $Y\in\cY$ its label. The asymptotic framework is that $n$ goes to infinity, while the observation time $T$ is assumed fixed. 
%Firstly, in Section \ref{sec:framework} we introduce the model, the main notations and we derive the Bayes rule in this context. Then in Section \ref{sec:Plug-inClass} we show how the missclassification error, for any plug-in classifier is linked to the error of estimation of the Hawkes parameters.
%Section \ref{sec:ClassProcedure} gives the classification procedure, which is based on minimization of the convexified empirical risk, and we derive the rates of convergence. 
%Finally, Section \ref{sec:NumExp} illustrates the matching between theory and practice, together with the robustness of the method. 


%%%%%%%%%%%%%%%%%%%%
\section{General framework}
\label{sec:framework}
%%%%%%%%%%%%%%%%%%%%

Section~\ref{subsec:modelNotations} introduces the considered model, some notation and explains the objective of the paper. In Section~\ref{subsec:BayesClassifier}, we provide an explicit formula of the optimal predictor.

%%%%%%%%%%%%%%%%%%%%
\subsection{Statistical setting}
\label{subsec:modelNotations}
%%%%%%%%%%%%%%%%%%%%
Let $Y$ a random variable which takes its values in $\mathcal{Y} = \{1,\ldots,K\}$, with $K \geq 2$, representing the label of the observations. The distribution of $Y$ is denoted by ${\bf p}^* = (p^*_k)_{k \in \mathcal{Y}}$ and is unknown.
We assume that the observations come from a mixture $N$ of simple linear Hawkes processes observed on the time interval $[0,T]$. Precisely, conditionally on $Y$, $N$ is a  simple linear Hawkes process.
The number of points that lie in $[0,t]$ is denoted by $N_t$ and the corresponding counting process is $(N_t)_{0\leq t \leq T}$. The jump times of $N$ are denoted $T_1, \ldots, T_{N_T}$. The filtration (or history) at time $t^-$ is denoted $\mathcal{F}_{t^-}$ and contains all the necessary information for generating the next point of $N$.

\paragraph{Conditional intensity}
The intensity of the process $N$ at time $t \geq 0$, with respect to the filtration $(\mathcal{F}_t)_{t \geq 0}$, is defined as
\begin{equation}
\label{def:lambdastar}
\lambda^*_Y(t): = \lambda^{(\mu^*,h^*_Y)}(t)
:= \mu^* + \sum_{ T_i < t} h^*_Y (t-T_i),
\end{equation}
where the first term $\mu^*>0$ is the baseline, or exogenous intensity, and the second term is a weighted sum over past events. For each class $k \in \mathcal{Y}$, the function $h^*_k$ is the triggering kernel which is
nonnegative and supported on $\R_{+}$. 
Besides, both parameters $\mu^*$ and ${\bf h}^*= (h^*_1, \ldots, h^*_K)$ are assumed to be unknown. 

Note that the baseline intensity is assumed to be common to all classes. This assumption is notwithstanding consistent according to the neuronal experimental setting described in Section~\ref{sec:intro}. Indeed, if the spike trains are recorded on the same type of neurons ({\it e.g.}~neurons which play the same role), it seems relevant to assume that the exogenous intensity is homogeneous between the classes.
%Also, we have taken the side of set the same baseline for all classes for mathematical computations. This assumption is coherent with the neuronal example that one can have in mind. Indeed, this also called exogenous rate coefficient could be the same if biologist measure the same kind of neurons (same role). However, one generalisation could be to consider a common time-inhomogeneous baseline.

\paragraph{Objective}
Given a sequence $\mathcal{T}_{T} = \{T_1, \ldots, T_{N_T}\}$ of observed jump times of $N$ over the fixed interval $[0,T]$, the goal is then to build a predictor, namely a classifier $g$, a measurable function such that $g(\mathcal{T}_T)$ is a prediction of the associated label $Y$.
The performance of a classifier $g$ is then measured through its misclassification risk
\begin{equation*}
\mathcal{R}(g) := \mathbb{P}\left(g(\mathcal{T}_T) \neq Y\right).
\end{equation*}  
In the following, we denote by $\mathcal{G}$ the set of classifiers.

%%%%%%%%%%%%%%%%%%%%
\subsection{Bayes rule}
\label{subsec:BayesClassifier}
%%%%%%%%%%%%%%%%%%%%
The unknown minimizer of $\cR$ over $\mathcal{G}$ is the so-called  Bayes classifier, denoted by $g^*$, and is characterized by 
\begin{equation*}
%\label{eq:bayesclassif}
g^{*} \left(\mathcal{T}_T\right) \in \argmax{k \in \mathcal{Y}} {\pi^*_{k}(\mathcal{T}_T)},
\end{equation*}
with $\pi^*_k\left(\mathcal{T}_T\right)= \mathbb{P}\left(Y = k | \mathcal{T}_T\right)$.
The following proposition gives the expression of the conditional probabilities $\pi^*_k$ and then provides a closed form of the Bayes classifier.

\begin{prop}
\label{prop:prop1}
Let $T \geq 0$. For each $k \in \mathcal{Y}$,
we define,
\begin{align}
\label{def:F}
& \hspace{-1em} F^*_k(\mathcal{T}_T) = F^{(\mu^*,h^*_k)}(\mathcal{T}_T) \\
& := - \int_0^T \lambda^{(\mu^*,h^*_k)}(s) \;\dd s +  \sum_{T_i \in \mathcal{T}_T} \log(\lambda^{(\mu^*,h^*_k)}(T_i)). \nonumber
\end{align}
Therefore, the sequence of conditional probabilities satisfies
\begin{equation*}
\pi^*_k\left(\mathcal{T}_T\right) = \phi^{\bf p^*}_k({\bf{F}}^*(\mathcal{T}_T)) \quad {\mathbb P}-a.s.,
\end{equation*}
where 
${\bf F}^*=(F^*_1, \ldots, F^*_K)$
and  $\displaystyle \phi^{\bf p^*}_k: (x_1, \dots, x_K) \mapsto \frac{ p^*_k{\rm e}^{x_k}}{\sum_{j=1}^K p^*_j {\rm e}^{x_j}}$ are softmax functions.
\end{prop}

Note that conditionally on the event $Y=k$, $F^*_k(\mathcal{T}_T)$ is the likelihood function of the sequence $\mathcal{T}_T$.
Proposition \ref{prop:prop1} highlights the dependencies of the optimal Bayes classifier {\it w.r.t.} the unknown parameters. In the following, for a given
classifier $g \in \mathcal{G}$, we define its excess risk as
\begin{equation*}
\cE\left(g\right) := \cR(g)-\cR(g^*).    
\end{equation*}

%Interestingly, Proposition~\ref{prop:prop1} suggests a plug-in approach to build consistent classification procedures. To be more specific, 
%
%% REMETTRE L'EXEMPLE EN TRAVAILLANT NOTAION ?
%\begin{ex}[\bf Notations \`a changer ensuite]
%Neurons. $K=2$ Two classes: healthy patients $K=0$, unhealthy patients $K=1$.\\
%Generalize on $D$ dimensions: coordinate number $j$ for $j=1, \ldots, D$
%$$
%\lambda_t=\mu_Y +\sum_{\ell =1}^D \int_{-\infty}^{t-} h^Y_{j,\ell} (t-u)dN_{\ell}(u)   , ~ t \in [0,T]. 
%$$
%where $h_{j, \ell}$ decay function that also quantify the influence of $\ell$ on $j$. 
%For example in \citep{BGM} the author consider instead 
%$h_{j,\ell}= a_{j ,\ell} h_{j \ell}$ is order to distinguish a coefficient of influence, later called \textbf{adjacency matrix} $A=(a_{j,j'})$, to the kernel $h$, fixed and known. 
%\end{ex}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Plug-in type classifier}
\label{sec:Plug-inClass}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We first introduce assumptions related to the model in Section~\ref{subsec:Assumptions} and then define a set of classifiers which relies on the plug-in principle in Section~\ref{subsec:setClassifier}. Finally, the main properties of the plug-in classifier are provided in Section~\ref{subsec:propertiesPlug-in}. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Assumptions}
\label{subsec:Assumptions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We first make the following assumptions on the triggering kernels. 

\begin{ass}[Stability condition]
\label{ass:h}
For each $k \in \mathcal{Y}$, $h_k: \R_+ \rightarrow \R_+$ is bounded and satisfies $\int h_k(t) \;\dd t < 1$.
%pour hawkes linéaire (donc $h$ positive) dès que $h$ est localement intégrable on a $E(N_t)<\infty$.
\end{ass}

\begin{ass}\label{ass:mu}
There exist $0 < \mu_0 < \mu_1$ such that $\mu_0 \leq \mu^* \leq \mu_1$.
\end{ass}

\begin{ass}\label{ass:prob}
There exists a positive constant $p_0$ such that $\min({\bf p}^*) > p_0$. 
\end{ass}

Assumption~\ref{ass:h} guarantees that $N_T$ admits finite exponential moments, that is, there exists $a>0$ such that $ \E[\exp(a|N_T|)] < \infty$, see for instance ~\citep{roueffsansonnet}.
%Indeed, from~\citep{roueffsansonnet}, we have 
%\begin{equation*}
%\exists a>0 \; \mbox{s.t.} \; \E[\exp(a|N_t|)] < \infty.
%\end{equation*}
In particular the exponential and power-law kernels satisfy this assumption (with additional assumptions on the corresponding parameters). 
%Note that under Assumption~\ref{ass:h}, we have $\E[ \lambda_t] \leq \mu t / (1- \|h\|_1)$, and moreover one can show the following result on the moments of the intensity process
 %$$\exists a>0, ~\text{s.t.}~ \E[\exp(a|N_t|)] < \infty$$
 %see \citep{roueffsansonnet}.
Assumption~\ref{ass:mu} is a technical assumption %utile de le dire ?
and Assumption~\ref{ass:prob} ensures that all the components
of the mixture occur with non-zero probability. 

Let us denote the following subset of probability weights
$$\cP_{p_0}:= \{{\bf p}\in \R^K_+: \; \sum_{i=1}^K p_i=1, \; \min({\bf p})>p_0\}.$$
%The constant $p_0$ does not need to be known.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Definitions}
\label{subsec:setClassifier}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% pool ou set ? 
In this section, we present the construction of the plug-in type classifiers.

First we introduce a set $\mathcal{H}$ of nonnegative functions supported on $\mathbb{R}_{+}$.
%For a vector of functions ${\bf h}= (h^1, \ldots, h^K) \in \mathcal{H}^K$,
%we introduce the supremum norm
%$$\|{\bf h}\|_{\infty,T} =\max_{j \in \mathcal{Y}}  \sup_{t \in [0,T]}|h^{(j)}(t)|.$$
%We also make the following assumption on the set $\mathcal{H}^K$
%\begin{ass}
%\label{ass:assOnH}
%There exists $A >0$ such that
%\begin{equation*}
%\sup_{{\bf h} \in \cH^K} \|{\bf h}\|_{\infty,T} \leq A.
%\end{equation*}
%\end{ass}
For a $K$-tuple ${\bf h} = (h_1, \ldots, h_K)$ in $\cH^K$, we associate $\mathbf{p}$ a vector of probability weights and a baseline intensity $\mu > 0$. For each $k \in \mathcal{Y}$, we then define 
%we consider $\mathbf{p} \in \cP_{p_0}$, $\mu \in [\mu_0,\mu_1]$, and . For each $k \in \mathcal{Y}$, we then define 
$$
\lambda_k(t) = \lambda^{(\mu,h_k)}(t) = \mu + \sum_{T_i < t} h_k (t-T_i), \quad t \in [0,T].
$$
Hence, the random functions $(\lambda_k)_{k=1,\ldots,K}$ are approximations  of the conditional intensities $\lambda^*_k$ defined by~\eqref{def:lambdastar}. Besides, similarly with the definition~\eqref{def:F} of $F^*_k(\mathcal{T}_T)$, we define
%\begin{eqnarray*}
%F_k(\cT_T) &=& F^{(\mu,h_k)}(\cT_T) \\
%& =& - \int_0^T \lambda_k(s) \;\dd s +  \sum_{T_i \in \cT_T} \log(\lambda_k(T_i)).
%\end{eqnarray*}
\begin{align*}
F_k(\cT_T) & = F^{(\mu,h_k)}(\cT_T) \\
& = - \int_0^T \lambda_k(s) \;\dd s +  \sum_{T_i \in \cT_T} \log(\lambda_k(T_i)).
\end{align*}

We also consider
\begin{equation}\label{eq:pi}
\pi^k_{{\bf p},\mu,{\bf h}}(.) : =  \phi^{{\bf p}}_k(F^{\mu,{\bf h}}(.)), 
\end{equation}
with the $\phi^{{\bf p}}_k$'s defined in the same manner of the $\phi^{{\bf p^*}}_k$'s given in Proposition~\ref{prop:prop1}.
%\begin{equation*}
%\phi^{{\bf p}}_k : (x_1, \dots, x_K)  \mapsto  \frac{ p_k{\rm e}^{x_k}}{\sum_{j=1}^K p_j {\rm e}^{x_j}}.
%\end{equation*}
Finally, we denote 
${\boldsymbol \pi}_{{\bf p},\mu,{\bf h}}(.) = \left(\pi^k_{{\bf p},\mu, {\bf h}}(.)\right)_{k \in \cY}$
%and
%\begin{equation*}
%\Pi = \{{\pi}_{{\bf p},\mu,{\bf h}};  \; \mu \in (\mu_0,\mu_1), {\bf h} \in \cH\}.
%\end{equation*}
and $\pi :=  {\boldsymbol{\pi}}_{{\bf p},\mu,{\bf h}}$.

A plug-in type classifier $g_{\pi}$ is naturally  defined as 
\begin{equation}
\label{eq:classifier}
g_{\pi}(\cT_T) =  \argmax{k \in \cY} \pi^k(\cT_T).
\end{equation}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Properties}
\label{subsec:propertiesPlug-in}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this section, we establish important properties of plug-in type classifiers. 
For a vector of functions ${\bf h} \in \mathcal{H}^K$,
let us denote the supremum norm
$$\|{\bf h}\|_{\infty,T} =\max_{k \in \mathcal{Y}}  \sup_{t \in [0,T]}|h_k(t)|.$$
We introduce for a positive constant $A$ the following set
%We also make the following assumption on the set $\mathcal{H}^K$
%\begin{ass}
%label{ass:assOnH}
%there exists $A >0$ such that
%\begin{equation*}
%\sup_{{\bf h} \in \cH^K} \|{\bf h}\|_{\infty,T} \leq A.
%\end{equation*}
%\end{ass}
\begin{equation*}
\mathcal{H}_A^K := \left\{{\bf h }\in\mathcal{H}^K \; \mbox{s.t.} \; \sup_{{\bf h} \in \cH^K} \|{\bf h}\|_{\infty,T} \leq A  \right\}    
\end{equation*}
and  the set of probabilities
\begin{equation}
\label{def:Pi}
\Pi = \left\{{\boldsymbol \pi}_{{\bf p},\mu,{\bf h}}: \; \mathbf{p} \in \cP_{p_0}, \; \mu \in (\mu_0,\mu_1), \;{\bf h} \in \cH_A^K\right\}.
\end{equation}
The first result is a key step to obtain the consistency of the classification procedure presented in Section~\ref{sec:ClassProcedure}.

\begin{prop}
\label{prop:distPi}
Let us consider $\pi$ and $\pi^{'}$ two vectors functions belonging to the set $\Pi$ defined by~\eqref{def:Pi} with respective parameters $( {\bf p}, \mu, {\bf h})$, and $( {\bf p}^{'}, \mu^{'}, {\bf h}^{'})$.
Grant Assumptions~\ref{ass:h}, \ref{ass:mu}, \ref{ass:prob}, %and~\ref{ass:assOnH}. 
the following holds
\begin{eqnarray*}
\mathbb{E}\left[ \left\|\pi -  \pi^{'}\right\|_1\right] &\leq & C \left(\left|\mu- \mu^{'}\right| + \left\|{\bf h} - {\bf h}^{'}\right\|_{\infty,T} \right.\\
&&  \left.+ \left\|{\bf h} - {\bf h}^{'}\right\|^2_{\infty,T} + \left\|{\bf p}-{\bf p}^{'}\right\|_{1}  \right),
\end{eqnarray*}
where $C$ is a constant depending on $K$, $T$, ${\bf h}^*$, $\mu_0$, $\mu_1$, $p_0$ and $A$.
\end{prop}

Proposition~\ref{prop:distPi} provides a bound on $L_1$-distance between two elements of the set $\Pi$. It shows that this distance is bounded by the distance between the corresponding parameters of the associated models. From this result, for a plug-in type classifier $g$, we can easily deduce a bound of its excess risk.

\begin{coro}
\label{coro:excessRiskPi}
For all $\pi= {\boldsymbol \pi}_{{\bf p},\mu,{\bf h}} \in \Pi$, we have that
\begin{eqnarray*}
\mathcal{E}\left(g_{\pi}\right)&\leq & C \left(\left|\mu- \mu^*\right| + \left\|{\bf h} - {\bf h}^*\right\|_{\infty,T} \right.\\
&&  \left.+ \left\|{\bf h} - {\bf h}^*\right\|^2_{\infty,T} + \left\|{\bf p}-{\bf p}^*\right\|_{1}  \right),
\end{eqnarray*}
where $C$ is a constant depending on $K$, $T$, ${\bf h}^*$, $\mu_0$, $\mu_1$, $p_0$ and $A$.
\end{coro}

An important consequence of this result is that a plug-in type classifier which relies on consistent estimators of ${\bf p^*}, \mu^*$ and ${\bf h}^*$
is then consistent {\it w.r.t.} misclassification risk.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Classification procedure}
\label{sec:ClassProcedure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This section is devoted to the presentation and the study of the proposed data-driven procedure that mimics the Bayes classifier.
Our estimation method is then presented
in Section~\ref{subsec:classProcUniDim} and theoretical guarantees of the procedure are derived in Section~\ref{subsec:Rates}. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Estimation strategy}
\label{subsec:classProcUniDim}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Based on the results of Section~\ref{sec:Plug-inClass},
we propose an hybrid classification procedure which involves both plug-in and empirical risk minimization (E.R.M.) principles.
To this end, we introduce a learning sample $\mathcal{D}_n = \{(\mathcal{T}^{i}_T, Y^{i}), i = 1, \ldots,n\}$, which consists of $n$ independent copies of $(\cT_T, Y)$.

%\subsubsection\paragraph{{Estimation strategy}
%\paragraph{Estimation strategy}

We propose a two-steps procedure. In a first step, we estimate the vector ${\bf p}^*$ by its empirical counterpart $\w{{\bf p}}$. 
%(again $p_0$ does not need to be known).
The second step relies on the empirical risk minimization over a suitable set.
In view of the results obtained in Section~\ref{subsec:propertiesPlug-in}, 
we introduce the following approximation of the set $\Pi$:
\begin{equation}
\label{eq:eqPiHat}
\w{\Pi} = \left\{{\boldsymbol \pi}_{\w{\bf p},\mu,{\bf h}}: \; \mathbf{p} \in \cP_{p_0}, \; \mu \in (\mu_0,\mu_1), \;{\bf h} \in \cH_A^K\right\}
\end{equation}
and
the corresponding set of classifiers: $$\mathcal{G}_{\w{\Pi}} = \{g_{\pi} : \; \pi \in \w{\Pi}\}.$$
Since $g^*$ is the minimizer of the misclassification risk,
a natural estimator of $g^*$ would be the empirical risk minimizer over the family $\mathcal{G}_{\w{\Pi}}$
\begin{equation*}
\hat{g} = \argmin{g \in \w{\Pi}}\;\frac{1}{n}\sum_{i=1}^n \one_{\{g(\cT^i_T) \neq Y^i\}}.
\end{equation*}
Nevertheless, as a solution of non convex minimization problem, it is known that this estimator is computationally intractable. 

\paragraph*{Convexification}
To avoid computational issues, it is then natural to replace the classical 0-1 loss with a convex surrogate (see~\citep{Zhang04}).
Let us denote the scores functions set: 
\begin{equation*}
\cF:= \{ {\bf f}=(f^1, \ldots, f^K): \cdot \rightarrow \R^K\}.
\end{equation*}
As convex surrogate, we consider the square loss and then define for a score function ${\bf f}$, the following risk
measure
\begin{equation*}
\cR({\bf f}) := \E \left[\sum_{k=1}^K \left(Z_k - f^{k}(\cT_T)\right)^2\right],
\end{equation*}
with $Z_k= 2 \one_{\{Y=k\}}-1$.

The choice of the square loss as a convex surrogate is motivated by the fact that, if we define $g(\cdot) = \argmax{k \in \mathcal{Y}} f^k(\cdot)$, then
\begin{equation}\label{eq:lienrisques}
{\E}\left[\cR(g)-\cR(g^*)\right] \leq \frac{1}{\sqrt{2}} \big( \E \left[ \cR({\bf f})- \cR({\bf f}^*) \right]\big)^{1/2},
\end{equation}
with ${ f}^{*k}(\cT_T)= 2 \pi^{*}_k(\cT_T) -1$ which satisfies  ${\bf f}^* \in  \argmin{{\bf f} \in \mathcal{F}} \cR({\bf f})$.
Hence, consistent procedure {\it w.r.t.} to the $L_2$-risk involves consistent classification procedure {\it w.r.t.}
the misclassification risk. 

\paragraph*{Resulting estimator}
As suggested by the form of the optimal score function ${\bf f}^*$, we then consider the set of scores functions 
\begin{equation*}
\w{\cF} = \{2\pi-1: \; \pi \in \w{\Pi}\},  
\end{equation*} 
and then consider the empirical risk minimizer over $\w{\cF}$:
\begin{equation}
\w{{\bf f}} \in \argmin{{\bf f} \in \w{\cF}} \w{\cR}({\bf f}), \label{eq:eqErm}
\end{equation}
with
\begin{equation}
\w{\cR}({\bf f}) := \dfrac{1}{n} \sum_{i=1}^n \sum_{k = 1}^K \left(Z_k^i-{\bf f}(\cT_T^i)\right)^2. \label{eq:eqErmR}    
\end{equation}
Finally, the resulting classifier $\w{g}$ is the plug-in type classifier associated to $\w{{\bf f}}$ defined as
\begin{equation}\label{eq:finalclassifier}
\w{g}=\argmax{k \in \mathcal{Y}} \w{{\bf f}}^k.
\end{equation}

Note that, in order to reduce the computational burden, we have chosen to not introduce the estimation of the probability weights ${\bf p}^*$ in the minimization problem given in Equation~\eqref{eq:eqErm}. Nevertheless it remains a possible strategy. 
%but to plug the empirical counterparts in the procedure. Nevertheless it remains a possible strategy. 

In the next section, we establish rates of convergence of our classification procedure.
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Rates of convergence}
\label{subsec:Rates}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
The study of the statistical performance of $\w{g}$ defined by~\eqref{eq:finalclassifier} relies on the following assumption.

\begin{ass}
\label{ass:assOnHNet}
Let $\varepsilon > 0$, we assume that there exists a $\varepsilon$-net $\cH_{\varepsilon} \subset \cH_A^K$, 
{\it w.r.t.} sup-norm $\|\cdot \|_{\infty, T}$ such that
\begin{equation*}
\log(\cC_{\varepsilon}) \leq C\log\left(\varepsilon^{-d}\right),
\end{equation*}
where $\cC_{\varepsilon}$ is the number of elements of $\cH_{\varepsilon}$, $d \geq 1$ and $C$ is a positive constant which does not depend on $\varepsilon$.
\end{ass}
%
%Under the above assumption, taking $\varepsilon \propto n^{-1}$ leads to the following result.
%
\begin{theo}
\label{thm:riskERM1}
Grant Assumptions~\ref{ass:h}, \ref{ass:mu} and \ref{ass:prob} and Assumption~\ref{ass:assOnHNet}.
If ${\bf h}^* \in \cH_A^K$, the following holds
\begin{equation*} 
\E\left[ \cR(\w{g}) - \cR(g^*)\right] \leq C \left(\dfrac{d\log(n)}{n}\right)^{1/4},
 \end{equation*}
where $C > 0$ depends on $K$, $T$, ${\bf h}^*$, $\mu_0$, $\mu_1$, $p_0$ and $A$.
\end{theo}
%
Theorem~\ref{thm:riskERM1} establishes that, when $n$ goes to infinity, the proposed classification procedure is consistent provided that ${\bf h}^*$ belongs to $\cH_A^K$. 
%We also obtain the rate of convergence with the observations number $n$ of the missclassification error of $\w{g}$ to the smallest one which is the Bayes error.
%The obtained rate in $n$ the number of observations, is of the same order as the one obtained in \citep{DDM} for discrete diffusion paths observations. 
If  ${\bf h}^*$ does not belong to $\cH_A^K$, a classical additional bias term appears. 

We also have to note that Theorem~\ref{thm:riskERM1}
applies for a broad class of functions $\cH$. In particular,
Assumption~\ref{ass:assOnHNet} covers the case where $\cH$ is a bounded linear subspace of functions. Let $(\psi_j)_{j \geq 1}$ an orthonormal basis such that
the basis functions are uniformly bounded and then we consider for $\theta_0 > 0$
\begin{equation*}
\mathcal{H} = \left\{ t \mapsto \left(\sum_{j=1}^d \theta_j \psi_j(t)\right)_{+} : \;\; \|\theta\|_2 \leq \theta_0 \right\},
\end{equation*}
as Laguerre basis for example.
Another important example is the parametric exponential family
\begin{equation*}
\cH = \{t \mapsto \alpha\beta \exp(-\beta t), \;\; 0 <\alpha < 1, \; 0 < \beta \leq \beta_0\},    
\end{equation*}
with $\beta_0 > 0$.
%Nevertheless, the obtained rate of convergence is worse than expected.
%This is due to the fact that the estimation of the probability weights and the estimation of the function $h$ and parameter $\mu$ are performed on the same dataset.
Finally, it is possible to obtain better rate of convergence
when the estimation of the probability weights and the estimation of ($\mu^*, {\bf h^*}$) are performed on two different independent datasets, this is the purpose of the next paragraph.

\paragraph{Alternative strategy} Hereafter, we consider an alternative strategy. First, we split the dataset $\mathcal{D}_n$ into two independent samples 
$\mathcal{D}_n^1$ and $\mathcal{D}_n^2$. Fore sake of simplicity, we assume that $n$ is even and that the two datasets
$\mathcal{D}_n^1$ and $\mathcal{D}_n^2$ have same size $n/2$. Based on $\mathcal{D}_n^1$, we estimate ${\bf p}^*$, and based on $\cD_n^2$ we estimate ${\bf f^*}$. 
The resulting classifier $\w{g}$ satisfies the following theorem.
%
%
\begin{theo}
\label{thm:riskERM2}
%Under Assumption~\ref{ass:assOnH}, and Assumption~\ref{ass:assOnHNet}, 
Grant Assumptions~\ref{ass:h}, \ref{ass:mu}, \ref{ass:prob} and~\ref{ass:assOnHNet}.
If ${\bf  h}^* \in \cH_A^K$, we have
\begin{equation*}
\E\left[\cR(\w{g}) - \cR(g^*)\right] \leq   C \left(\dfrac{d\log(n)}{n}\right)^{1/2},
\end{equation*}
with $C>0$ a numerical constant. 
\end{theo}
%
%
Therefore, the classifier $\w{g}$ achieves parametric rate of convergence up to a logarithmic term. 
Note that from practical point of view, the splitting of the sample does not affect the performance of the classifier $\hat{g}$. Therefore, we do not consider this strategy in the numerical section. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Comments}
\label{subsec:commentsOnProc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this section we make  comments about the proposed procedure.


 \paragraph*{Parameter $\mu$}
 Contrary to the parameter $p_0$, the estimation procedure requires the  knowledge of $\mu_0$ and $\mu_1$. This assumption is important to obtain the consistency property. However, we shall show in Section~\ref{sec:NumExp} that the procedure has good performance if we only assume that $\mu^* > 0$.
 
 
 \paragraph*{Estimation of the weights $\bf{p}^*$}
For the estimation of the mixture weights, another approach is to include the estimation of ${\bf p}^*$ in the minimization procedure. In this case, the rate of convergence of the classification procedure is the same as the one provided in Theorem~\ref{thm:riskERM2}. However, we do not consider this approach since it significantly increases the computational cost of the procedure, especially if the number of classes is large.

 
 %An issue in Hawkes processes is the estimation of the baseline parameter $\mu$. Nevertheless, if an estimator is available and can be compute on a previous sample, this should lighten the procedure of risk minimization. For example, \citep{BacryCumulant} focus on the estimation of the matrix of integrated kernels for a multivariate Hawkes process, and get in the process an estimator of $\mu$. However, the algorithmic cost for the computation of this estimator is big.  
 
    
\paragraph*{Other approach}
%Another strategy is possible. Indeed, Proposition \ref{prop:distPi} supports this idea. 
Another strategy is possible motivated by Proposition~\ref{prop:distPi}.
For example, assuming that the triggering kernels belong to the exponential kernel family, then classical estimators of the parameters can be used. Therefore, with these estimators we can compute a plug-in type classifier. For this task,  the methods implemented in the \texttt{tick} library as Maximum Likelihood or Least-Squares estimator can be used. In the next section we illustrate this strategy with the Least-Squares estimator.


    % The case of observations made from a multidimensional Hawkes process should be very interesting. 
    %In the neuronal data framework, we easily imagine that the measure we have for each patient is not a single \textit{spike train} but measurements of a network of neurons (see for \emph{e.g.} \citep{Hansen15}, \citep{rousseau}).
    %A procedure of classification can adapted from the present work, to get labels from minimization of empirical risk. This should simultaneously capture the nature of interactions. For example, when we work under the assumption of exponential kernels, 
    %the plug-in technique should benefit from algorithm as ADM4 for high dimension (see \citep{BGM}). Indeed, the high dimension will be a real challenge here whether the kernel collection is assumed parametric or not.
    %\citep{Hansen15}, \citep{BGM}
    %The will be the object of further works. 
    
     



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Numerical experiments}
\label{sec:NumExp}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this section, we present numerical experiments to illustrate the performance of the procedure described in Section~\ref{subsec:classProcUniDim} and refer to the resulting algorithm as \texttt{ERM}.
We focus on the case where the set $\cH$ is the parametric exponential family. Then our method is compared to the plug-in strategy presented in Section~\ref{subsec:commentsOnProc} which is referred as \texttt{PI}. 

% new 
We also include a comparison with Long Short-Term Memory (\texttt{LSTM}) algorithm. Indeed, these recurrent neural networks are used in time series forecasting and are a natural solution to study time dependency in data.
The main numerical limitation is that the user needs to choose a length for the data whereas in the case of point processes the length is different for each sequence. This length has consequently to be chosen large enough to not lose information on the test sample. We use the \texttt{tensorflow.keras} library of Python with tuning parameters calibrated as follows: \texttt{batch\_size=10}, \texttt{epochs=100} and \texttt{learning\_rate=0.01}.



The details of the implementation of the \texttt{ERM} estimator are given in Section~\ref{subsec:Implementation}. 
Then, we describe the experimental setting in Section~\ref{subsec:SyntheticData} and discuss the obtained results in Section~\ref{subsec:results}.
%The source code we used to perform the experiments can be found at~\url{https://github.com/charlottedion/HawkesClassification}. 
The source code we used to perform the experiments can be found at~\url{https://github.com/charlottedion/HawkesClassification}. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Implementation}
\label{subsec:Implementation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We present the implementation of our classification procedure in the case where the set of kernel functions $\cH$ is the parametric exponential family defined as
\begin{equation*}
\cH = \{t \mapsto \alpha\beta \exp(-\beta t), \;\; 0< \alpha <1, \; \beta > 0\}.    
\end{equation*}
We define for $\alpha, \beta \in \mathbb{R}$ the function
\begin{equation*}
h_{\alpha,\beta}(t) = {\rm expit}(\alpha) \exp(\beta) \exp(-\exp(\beta) t),     
\end{equation*}
where ${\rm expit}$ denotes the inverse-logit function.
Then,  we can write $\cH$ as $\cH = \{t \mapsto h_{\alpha,\beta}(t), \;\; \alpha, \beta \in \mathbb{R}\}$.
%We use this transformation to ease the optimization part of the procedure.
For $\boldsymbol{\alpha}$ and $\boldsymbol{ \beta}$ in  $\mathbb{R}^K$, we denote by   $\bf{h}_{\boldsymbol{\alpha},\boldsymbol{\beta}}$ the corresponding function of $\cH^K$. 
Therefore the set $\w{\Pi}$ defined in Equation~\eqref{eq:eqPiHat} can be rewritten as
\begin{equation*}
\w{\Pi} = \{\boldsymbol \pi_{\bf{\hat{p}}, \exp(\mu), {\bf h}_{\boldsymbol{\alpha}, \boldsymbol{ \beta}}}, \;\; \mu \in \mathbb{R}, \;\; \boldsymbol{\alpha} , \boldsymbol{\beta} \in \mathbb{R^K}\}.
\end{equation*}
Hence the minimization step is performed {\it w.r.t.} $\mu$, ${\boldsymbol \alpha}$, and $\boldsymbol {\beta}$.  
Note that the formulation of the above set $\hat{\Pi}$ shows that the optimization part of our classification procedure does not require any constraint on the parameters.
The minimization is performed with the \texttt{Python} function \texttt{minimize} with argument method \texttt{BFGS}. % and optional argument \texttt{maxiter}$=10$.
%\textcolor{red}{Voir ICML 2013 papier biblio, page 4}
%In the whole section we assumed to be under the exponential model. This means that the kernels have the shape
%$h_Y(t)= \alpha_Y \beta_Y \exp(-\beta_Y t)$ and the optimization is done on parameters $(\mu_Y,\alpha_Y, \beta_Y)$. 
% We denote ERM for the classifier $\w{g}$ obtained under this assumption. 
%The, we denote NP-ERM for the classifier $\w{g}$ obtained in the simple linear Hawkes process case with a general kernel $h$. In this case the minimization is done over a class of function. We investigate is this case the Laguerre basis and the (modifier) Fourier basis. 
Algorithm~\ref{alg:algo1} sums up the main steps of the procedure. 

\begin{algorithm}
   \caption{Classification algorithm}
   \label{alg:algo1}
\begin{algorithmic}
   \STATE {\bfseries Input:} $T$,  $\cD_n$, and new observation $\cT_{n+1}$ %end time $T$
   %\STATE Split $\cD_n$ in $\cD_{n_1}$ $\cD_{n_2}$
   \STATE \hspace*{0.5cm}  Estimate ${\bf p^*}$ on $\cD_{n}$
   \STATE \hspace*{0.5cm} Solve the minimization problem~\eqref{eq:eqErm} based on $\cD_n$
   %\FOR{each $(\mu,\alpha, \beta) \in \Theta$ } 
   %\FOR{each $k \in \cY$}
   %FOR{each $i \in 1:n$}
   %\STATE Compute $\w{\pi}_k(\cT_i)$
   %\ENDFOR
   %\ENDFOR
   %\ENDFOR
   \STATE \hspace*{0.5cm} Compute $\w{g}$ the resulting classifier~\eqref{eq:finalclassifier}
   \STATE \hspace*{0.5cm} Compute $ \hat{Y}_{n+1} = \w{g}(\cT_{n+1})$
   \STATE {\bfseries Output:} Predicted label $\hat{Y}_{n+1}$
\end{algorithmic}
\end{algorithm}

For the procedure \texttt{PI}, we use the \texttt{tick} function \texttt{HawkesExpKern} with argument \texttt{gofit = least-squares} for the parameter inference.
%As it as been said before, another strategy is to use an estimator of the parameters obtained on the learning sample, and to plug it in the Bayes classifier to obtain a %\textit{plug-in} classifier. We investigate here  available from the \texttt{tick} library: the Least-Squares learner. We name it PG in the following. 

%Let us give some details on the code. 

%To ensure that the estimated parameters remains nonnegative, we apply an exponential transformation an write the parameters $\exp(\alpha'), \exp(\beta')$ instead in the algorithm.\textbf{A CHANGER}.\\
%Finally, we use the BFGS optimization method as argument of the function \texttt{minimize} in Python with option \texttt{maxiter}$=2$.
%, with the option 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Experimental setting}
\label{subsec:SyntheticData}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We consider $K=2$ or $K = 3$ classes in the following.
We propose two different models for the experiments that we refer to as Model~1 and Model~2.
For Model~1, we consider the case where the triggering kernel belongs to the parametric exponential family. For Model~2, we investigate a more general form for the kernels (see below). We set the baseline intensity $\mu =1$. We use the library \texttt{tick} to generate the sequence of jump times of the Hawkes processes.

\paragraph*{Synthetic data}
The label $Y$ is drawn from a uniform distribution on $\{1, \ldots,K\}$. Conditionally on $Y$, we simulate the jump times according to Model~1 and Model~2 which are defined as follows:
\begin{description}
    \item[Model~1] exponential kernels $h(t)= \alpha \beta \exp{(-\beta t)}$, with $(\alpha, \beta)= (0.7, 1.3)$ for class $Y = 1$, 
    $(0.2, 3)$ for class $Y=2$, and if $K=3$, $(0.5, 5)$ for class $Y =3$. 
    \item[Model~2]  interpolation function kernels with parameters $(a,b,c)$: 
    $$h(t)= \begin{cases}
    \frac{b}{a}t, ~ t \in [0, a], \\
    \frac{b-c}{a-1}t+ (b-\frac{b-c}{a-1} a), t \in ]a,1[ \\
    0, ~t  \geq 1\\
    \end{cases}
   $$
    with for  $(a, b,c)=( 0.2, 0.8, 0.2)$ for class $Y=1$, $(0.1, 0.4, 0.2)$ for $Y=2$, and if $K=3$, $(0.8, 0.3, 0.7)$ for class $Y=3$. 
\end{description}
As an illustration, Figure~\ref{fig:kernels} displays the considered kernels for both models. We can see from this figure that for Model~1 the kernel of the class $Y=1$ seems to be different of the kernels of the classes $Y=2$ and $Y=3$ which are more closed. Hence, it should be easy to discriminate between observations from class $Y=1$ and observations from class $Y \in \{2,3\}$. On the contrary, observations from class $Y=2$ and class $Y=3$ would be overlapped. Similar comments can be made for Model~2
with observations from class $Y \in \{1,2\}$ and observations from class $Y = 3$. 

\begin{figure}
    \centering
    \includegraphics[scale= 0.3]{explKernelExpoK3.pdf}
    \includegraphics[scale= 0.3]{explKernelInterpol.pdf}
    \caption{Kernel functions of Top: Model~1 and Bottom: Model~2 for Left: class $Y=1$, Middle: class $Y=2$ and Right: class $Y=3$.}
    \label{fig:kernels}
\end{figure}

We also investigate the role of parameter $T$ on the difficulty of classification problem. To this end, Figure~\ref{fig:expoBayes} displays
the error rate of the Bayes classifier as a function of $T$ for Model~1 and $K=3$. This error quickly decreases from $0.3$ to $0.05$ as $T$ goes from $10$ to $40$.
In the following, we shall give results for $T=20$.

\begin{figure}
    \centering
    \includegraphics[scale= 0.4]{influenceTBayesK3.pdf}
    \caption{Error rate of the Bayes classifier as a function of $T$ for $K=3$, $n=100$.}
    \label{fig:expoBayes}
\end{figure}

\begin{table*}
\caption{Error rates of Bayes, \texttt{ERM}, \texttt{PI}, and \texttt{LSTM} classifiers for $n=100$, $T=20$.}
\label{table1}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{lcccr}
\toprule
Classifier: & Bayes & \texttt{ERM} & \texttt{PI} & \texttt{LSTM} \\
\midrule
$K=2$, model 1 & 0.07 (0.01) & 0.08 (0.01) & 0.08 (0.01) & 0.09 (0.01) \\
$K=2$, model 2 & 0.27 (0.01) & 0.29 (0.02) & 0.29 (0.01) & 0.33 (0.02) \\
$K=3$, model 1 & 0.17 (0.01) & 0.18 (0.02) & 0.19 (0.02) & 0.36 (0.03) \\
$K=3$, model 2 & 0.39 (0.01) & 0.46 (0.02) & 0.45 (0.02) & 0.54 (0.03) \\
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\end{table*}


\begin{table*}
\caption{Error rates of Bayes, \texttt{ERM}, \texttt{PI}, and \texttt{LSTM} classifiers for $n=1000$, $T=20$.}
\label{table2}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{lcccr}
\toprule
Classifier: & Bayes & \texttt{ERM} & \texttt{PI}  & \texttt{LSTM} \\
\midrule
$K=2$, model1 & 0.07 (0.01) & 0.08 (0.01) & 0.08 (0.01) & 0.08 (0.01) \\
$K=2$, model 2 & 0.27 (0.01) & 0.28 (0.01) & 0.28 (0.02) &  0.30 (0.01)\\
$K=3$, model 1 & 0.17 (0.01)& 0.17 (0.01) & 0.18 (0.01)  & 0.33 (0.02) \\
$K=3$, model 2 & 0.39 (0.01) & 0.43 (0.01) & 0.44 (0.01) & 0.49 (0.02) \\
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table*}


\paragraph*{Simulation scheme}

In order to assess the performance of our procedure, we evaluate the misclassification risk of the Bayes classifier, \texttt{ERM}, \texttt{PI}, and \texttt{LSTM}
through Monte-Carlo repetitions.
%The misclassification risk of the Bayes classifier $\E[\cR(g^*)]$ and $\E[\cR(\w{g})]$ for the proposed classifier, are approximated through Monte-Carlo simulations, with $50$ repetitions. 
More precisely, for $n \in \{100,1000\}$ and $n_{{\rm test}} = 1000$, we repeat independently $50$ times the following steps:
\begin{enumerate}
    \item simulate two datasets $\cD_n$ and $\cD_{n_{\rm test}}$, %%with $n_{\rm test}= 1000$, 
    \item from $\cD_n$ compute the classifier $\w{g}$, and
    \item based on $\cD_{n_{\rm test}}$, compute the empirical error rate of the three classifiers.
\end{enumerate}
The obtained results are presented in Table~\ref{table1} for $n=100$ and Table~\ref{table2} for $n=1000$.
Note that, for \texttt{ERM} algorithm, the following initial guess for the optimization step is considered: $\mu=0.5$, $\alpha=1$ and $\beta=1$ for all classes.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Results}
\label{subsec:results}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

From the obtained results, we make several comments.
For Model~1, the \texttt{ERM} and \texttt{PI} procedures achieve similar performance to the Bayes classifier for $n \in \{100 ,1000\}$, and $K \in \{2,3\}$. Similar comments can be made for Model~2 and $K=2$.
For $K=3$, the difference between the error rates of the Bayes classifier and the \texttt{ERM} and \texttt{PI} classifiers is larger.
%However, note that for model~2
%For $K=2,3$,  the \texttt{ERM} and \texttt{PI} procedures achieve similar performance which as the Bayes classifier for any model and $n \in \{100,1000\}$.
%The case $K=3$ is more interesting.
%First for Model~1, the \texttt{ERM} error rate is almost equal to the Bayes error for  $n \in \{100 ,1000\}$, while \texttt{PI} has worst performance.
%Interestingly, in this case it seems that our procedure benefits from the fact that the model is well-specified. 
%Second, for Model~2, we can see the influence of parameter $n$. 
Let us notice that, when $n$ increases the error rate of \texttt{ERM} is closer to the error rate of the Bayes classifier.
%Besides, in this case, \texttt{ERM} outperforms \texttt{PI}.


Finally, we can see that  the \texttt{LSTM} algorithm
has the worst performance in almost every scenario except for model~1 and $K=2$. Hence, our
classifier \texttt{ERM} is competitive to classify event sequences and can be recommended for future works.

%We see that for Model~1 the \texttt{ERM} error rate is almost equal to the Bayes error for  $n \in \{100 ,1000\}$, and $K \in \{2,3\}$.
%Same comment can be made for Model~2 and $K=2$. For $K=3$
%the difference between the error rates of the Bayes classifier and the {\texttt ERM} classifier is larger for $K=2$
%Under Model~2, the gap between the Bayes and the \texttt{ERM} is very thin for $K=2$. When $K=3$ we can see clearly the influence of parameter $n$: the gap between the Bayes error and the \texttt{ERM} error reduces. 
%Finally, the \texttt{PG} classifier performs well also, except for model 1 (exponential) and $K=3$. Indeed in this case it seems that our procedure benefits more from the exponential case. Besides, the \texttt{ERM} is the best one in all cases. 



%\subsection{Implementation}\text{}
%\label{subsec:ERMImplementation}
%bacry work for law rank and high dimension \citep{BGM}

Let us notice that our procedure also outputs estimations of the parameters $(\mu,\alpha, \beta)$. Although the estimation task is not our main purpose,
it is interesting to evaluate the accuracy of the obtained estimators. 
Figure \ref{fig:boxplots} displays a visual description of the obtained estimates for $n \in \{100,1000\}$  for Model~1 with observations coming from the class $Y =1$.
%the light-blue boxplots are done with $n=100$ and the dark-blue once with $n=1000$. 
Again, we can see the impact of the parameter $n$. For $n = 1000$, the estimation of the three parameters are clearly better than for $n= 100$.
Furthermore, for $n = 1000$, the resulting estimates are quite good.
 
\begin{figure}
    \centering
    \includegraphics[scale= 0.4]{BoxplotK2Class1.pdf}
    \caption{Boxplots of estimates of $(\mu,\alpha, \beta)$ of Model 1 for class $Y=1$ for $50$ repetitions. True parameters are $(1, 0.7, 1.3)$.}
    \label{fig:boxplots}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion}
\label{sec:Dicussion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We investigate the multiclass classification setting where the features come from
a mixture of simple linear Hawkes processes. In this framework, we derive the optimal predictor and provide a classification procedure tailored to this problem. The resulting algorithm relies on both plug-in and empirical risk minimization principles. We establish theoretical guarantees and illustrate the good performance of the method through a numerical study.    

In future works, we plan to extend our classification procedure to the
case where the observations come from a mixture of multidimensional Hawkes processes.
Indeed, in neuroscience, the modeling of multivariate neuron spike data is used for taking into account potential interactions between neurons (see {\it e.g.} \citep{Hansen15}, \citep{rousseau}). Hence, it should capture the interactions between neurons. In this framework, a challenge is to take into account the high dimension of the space of parameters. For example, by considering exponential kernels, plug-in type classifier should benefit from algorithm as ADM4 which is adapted for high dimensional setting~\citep{BGM}.
%In the neuronal data framework, we easily imagine that the measure we have for each patient is not a single \textit{spike train} but measurements of a network of neurons (see for {\it e.g.} \citep{Hansen15}, \citep{rousseau}).

%A classification procedure can be adapted from the present work to get labels from empirical risk minimization. This should another capture the nature of interactions. 
%For example, we can consider exponential kernels, then
%plug-in type classifier should benefit from algorithm as ADM4 for high dimension (see \citep{BGM}). Indeed, the high dimension will be a real challenge here whether the kernel collection is assumed parametric or not.
%The will be the object of further works. 

Another possible development is the case of nonlinear Hawkes process. A few works focus on this subject, see {\it e.g.} \citep{BM1996}, \citep{lemonnier}, \citep{Chi}. This allows us to consider kernels which can take negative values to model an inhibitory behaviour. The proposed algorithm should remains efficient. Nevertheless, it will be trickier to establish rates of convergence.
%for our classification  procedure.

Finally, we could also extend our method to a model with a common time-inhomogeneous baseline. This idea is considered in many applications (see {\it e.g.} \citep{changepoint}) and could be an improvement of the present algorithm. 

%Finally, one can think about setting with a common time-inhomogeneous baseline. This idea is considered in many applications (see \emph{e.g.} \citep{changepoint}). Our method can be extended to this setting.







%%%%%%%%%%%%%%%%%%%%
\bibliography{BIB_UAI}
%%%%%%%%%%%%%%%%%%%%
%

\end{document}
