% \documentclass{uai2022} % for initial submission

\documentclass[accepted]{uai2022}

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{float} % added by Yu Chen
% \usepackage{subcaption} % added by Yu Chen

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
% \newcommand{\swap}[3][-]{#3#1#2} % just an example

% If you use natbib package, activate the following three lines:
% \usepackage[round]{natbib}
% \usepackage{natbib}
% \renewcommand{\bibname}{References}
% \renewcommand{\bibsection}{\subsubsection*{References}}

% use Times
\usepackage{times}
% For figures
\usepackage{graphicx} % more modern
%\usepackage{epsfig} % less modern
\usepackage{subfigure}

% % For citations
% \usepackage{natbib}

% For algorithms
\usepackage{algorithm}
\usepackage{algorithmic}

\usepackage{hyperref}

\renewcommand{\theHalgorithm}{\arabic{algorithm}}

\newcommand{\csize}{
\fontsize{8}{8}\selectfont
}

\newcommand{\csizenine}{
\fontsize{9}{9}\selectfont
}

\newcommand{\csizenineplus}{
\fontsize{9.5}{9.5}\selectfont
}

\newcommand{\csizeten}{
\fontsize{10}{10}\selectfont
}

\newcommand{\tabsize}{
\fontsize{7}{7}\selectfont
}

\renewcommand\algorithmiccomment[1]{
  {
  	{
% 	\csizenine    
  	{\textit{\%\ #1}}
  	}
  }
}

% \frenchspacing

\newcommand{\ug}[1]{{\color {magenta} #1}}
\usepackage{url}  %Required
% \frenchspacing  %Required
% \usepackage{amsmath}
\usepackage{verbatim}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{epstopdf}
\usepackage{lipsum}
\usepackage{color}
\usepackage[normalem]{ulem}

\newcommand{\cA}{{\mathcal{A}}}
\newcommand{\cB}{{\mathcal{B}}}
\newcommand{\cC}{{\mathcal{C}}}
\newcommand{\cD}{{\mathcal{D}}}
\newcommand{\cG}{{\mathcal{G}}}
\newcommand{\cI}{{\mathcal{I}}}
\newcommand{\cN}{{\mathcal{N}}}
\newcommand{\cM}{{\mathcal{M}}}
\newcommand{\cO}{{\mathcal{O}}}
\newcommand{\cP}{{\mathcal{P}}}
\newcommand{\bP}{{\mathbf{P}}}
\newcommand{\cR}{{\mathcal{R}}}
\newcommand{\cS}{{\mathcal{S}}}
\newcommand{\cH}{{\mathcal{H}}}
\newcommand{\cK}{{\mathcal{K}}}
\newcommand{\cT}{{\mathcal{T}}}
\newcommand{\cU}{{\mathcal{U}}}
\newcommand{\cV}{{\mathcal{V}}}
\newcommand{\cY}{{\mathcal{Y}}}
\newcommand{\cZ}{{\mathcal{Z}}}
\newcommand{\newsetminus}{{\!-\!}}
\newcommand{\cVmA}{{\cV\newsetminus\cA}}
\newcommand{\cX}{{\mathcal{X}}}
\newcommand{\cs}{s}
\newcommand{\cVms}{{\cV-\cs}}

\newcommand{\ba}{{\mathbf{a}}}
\newcommand{\bb}{{\mathbf{b}}}
\newcommand{\bu}{{\mathbf{u}}}
\newcommand{\bx}{{\mathbf{x}}}
\newcommand{\resid}{\cR}

\newcommand{\NP}{{\mathbf{NP}}}

% \DeclareMathOperator{\MIF}{MI} 

\newcommand{\bs}[1]{\boldsymbol{#1}}

\newcommand{\mb}[1]{\mathbf{#1}}

\newcommand{\mhk}{\cM^h_k}

\newcommand{\thmref}[1]{Theorem~\ref{#1}}
\newcommand{\tabref}[1]{Table~\ref{#1}}
\newcommand{\figref}[1]{Fig.~\ref{#1}}
\newcommand{\eqnref}[1]{Eq.~\ref{#1}}
\newcommand{\secref}[1]{Sec.~\ref{#1}}
\newcommand{\appref}[1]{Appendix~\ref{#1}}
\newcommand{\prcref}[1]{Procedure~\ref{#1}}
\newcommand{\assmref}[1]{Assumption~\ref{#1}}
\newcommand{\crlref}[1]{Corollary~\ref{#1}}
\newcommand{\algoref}[1]{Alg.~\ref{#1}}
\newcommand{\prpref}[1]{Proposition~\ref{#1}}
\newcommand{\cnjref}[1]{Conjecture~\ref{#1}}
\newcommand{\axmref}[1]{Axiom~\ref{#1}}
\newcommand{\lmaref}[1]{Lemma~\ref{#1}}

\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[lemma]{Corollary}
\newtheorem{procedure}[lemma]{Procedure}
\newtheorem{assumption}[lemma]{Assumption}
\newtheorem{claim}[lemma]{Claim}
\newtheorem{conclusion}[lemma]{Conclusion}
\newtheorem{proposition}[lemma]{Proposition}
\newtheorem{conjecture}[lemma]{Conjecture}
\newtheorem{axiom}[lemma]{Axiom}
\newtheorem{algo}[lemma]{Algorithm}
\newtheorem{definition}{Definition}
\newtheorem{remark}{Remark}


%additions suggested by Sahil
\newcommand{\s}[1]{\textcolor{magenta}{#1}}

% %deletions suggested by Sahil
% \newcommand{\sd}[1]{\textcolor{orange}{#1}}

%additions suggested by Sahil
\newcommand{\todo}[1]{\textcolor{blue}{Sahil's todo: #1}}

% \newcommand{\uai}[1]{\textcolor{brown}{#1}}

\newcommand{\te}{TE }
\newcommand{\tes}{TE}

\definecolor{shadecolor}{gray}{0.95}
\newcommand{\algshade}[1]{
    \hspace*{-\fboxsep}
    %\vspace*{-\fboxsep}
    \colorbox{shadecolor}{
        \parbox{\linewidth}{#1}
    }
}

% If your paper is accepted and the title of your paper is very long,
% the style will print as headings an error message. Use the following
% command to supply a shorter title of your paper so that it can be
% used as headings.
%
%\runningtitle{I use this title instead because the last one was very long}

% If your paper is accepted and the number of authors is large, the
% style will print as headings an error message. Use the following
% command to supply a shorter version of the authors names so that
% they can be used as headings (for example, use only the surnames)
%
%\runningauthor{Surname 1, Surname 2, Surname 3, ...., Surname n}

% \twocolumn[

% %% Self-defined macros
% \newcommand{\swap}[3][-]{#3#1#2} % just an example


\title{Estimating Transfer Entropy under Long Ranged Dependencies}

% Add authors

% {\href{mailto:<sahil.garg.cs@gmail.com>?Subject=Your UAI 2022 paper}

\author[1,*]{Sahil Garg}
\author[2]{Umang Gupta}
\author[1,$\dagger$]{Yu Chen}
\author[1,$\dagger$]{Syamantak Datta Gupta}
\author[1,$\dagger$]{Yeshaya Adler}
\author[1]{Anderson Schneider}
\author[1]{Yuriy Nevmyvaka}

\affil[1]{%
    Department of Machine Learning Research\\
    Morgan Stanley\\
    New York, NY, USA\\
}

\affil[2]{%
    Department of Computer Science\\
    University of Southern California\\
    Los Angeles, CA, USA
}
    
\affil[*]{Corresponding Author: sahil.garg.cs@gmail.com, sahil.garg@morganstanley.com}

\affil[$\dagger$]{Equal Contributions}

\begin{document}

\maketitle

\begin{abstract}
Estimating Transfer Entropy~(\tes) between time series is a highly impactful  problem in fields such as finance and neuroscience. The well-known nearest neighbor estimator of \te potentially fails if temporal dependencies are \emph{noisy and long ranged}, primarily because it estimates \te \emph{indirectly} relying on the estimation of \emph{joint entropy} terms in high dimensions, which is a hard problem in itself. Other estimators, such as those based on Copula entropy or conditional mutual information have similar limitations. Leveraging the successes of modern discriminative models that operate in high dimensional~(noisy) feature spaces, we express \te as a difference of two \emph{conditional entropy} terms, which we \emph{directly} estimate from conditional likelihoods computed in-sample from any discriminator~(timeseries forecaster) trained per maximum likelihood principle. To ensure that the in-sample log likelihood estimates are not overfit to the data, we propose a novel perturbation model based on locality sensitive hash~(LSH) functions, which regularizes a discriminative model to have smooth functional outputs within local neighborhoods of the input space. Our estimator is consistent, and its variance reduces linearly in sample size. We also demonstrate its superiority w.r.t. state-of-the-art estimators through empirical evaluations on a synthetic as well as real world datasets from the neuroscience and finance domains.
\end{abstract}

\section{Introduction}
% 
Information theory plays a central role in modern machine learning for tasks like clustering, feature selection, representation learning, autoencoding, generative modeling, fairness, etc.~\citep{shannnon1948mathematical,cover1999elements,cicalese2019new,kingma2019introduction,song2019learning,song2021train}. A relatively new concept in information theory, introduced by \cite{schreiber2000measuring}, is \emph{transfer entropy}~(\tes) that quantifies the reduction in uncertainty about one time series given another~(see \figref{fig:transfer_entropy}). \te is theoretically and practically appealing for various domains, including finance and neuroscience~\citep{vicente2011transfer,jizba2012renyi,verSteeg2012WWW,ursino2020transfer,restrepo2020transfer,sipahi2020improving}. 
    
% barnett2009granger
% he2017comparison

\begin{figure}[t!]
\centering
\includegraphics[width=\columnwidth]{transfer_entropy.png}
\caption{Three different time series are shown in order to illustrate \emph{transfer entropy}~(\tes). There is a clear pattern of \te from the second time series~(green) to the first one (yellow), i.e. predictability of the observations in the first time series given the knowledge of the second one.
% 
Similarly, there is \te from the third time series~(navy) to the second one, though the dependencies are relatively complex.
% 
% Furthermore, observations to which the red arrows point to are explainable by only some of the many past observations since the temporal \emph{dependencies are long ranged and noisy}. This makes the problem of \te estimation challenging.
}
\label{fig:transfer_entropy}
\end{figure}
    
With the recent rapid advance of simultaneous high-density recordings of neural activities across multiple brain areas \citep{siegle2021survey, steinmetz2021neuropixels}, it is essential to have a scalable and robust model for estimating TE between neural ensembles in the presence of sparse signal and large noise due to ubiquitous neuron-to-neuron or trial-to-trial variance \citep{steinmetz2018challenges, kass2018computational}. In the finance domain, given the low-signal-to-noise ratio, empirical models leverage advances in TE estimation by filtering out very weak explanatory time series~\citep{dimpfl2013,sandoval2014}.
    
\begin{figure}[!t]
% 
\centering
% 
% \subfigure[Estimation via CE.]{
% \includegraphics[
% width=0.66\columnwidth]{te_ours.png}
% \label{fig:te_ours}
% }
% 
\subfigure[kNN or Copula Estimator.]{
\includegraphics[
width=0.45\columnwidth]{te_joint_entropy.png}
\label{fig:te_joint_entropy}
}
% 
\subfigure[ITENE Estimator.]{
\includegraphics[
width=0.45\columnwidth]{te_cmi.png}
\label{fig:te_cmi}
}
% 
\caption{
%
% \te can be expressed in various forms, corresponding to its many estimators, which are mathematically equivalent but drastically different in practice.
% 
% We show the fundamental difference between \te estimators.
% 
% In all the figures, referring back to \figref{fig:transfer_entropy}, 
% 
Limitations of Estimators.
% 
The yellow dots represent the past~($\bs{\cY}_{t-1}$) and the present~($\cY_t$) of timeseries Y and the green ones are for the past~($\bs{\cX}_{t-1}$) of timeseries X.
% 
Present observation for timestep $t$ has noisy dependencies w.r.t. its own past and of X. 
% 
% For instance, the present observation of Y can not be explained from its own past, and many of the past observations of X are also irrelevant.
% 
% Considering the challenge of dependency based noise especially when estimating \te under long ranged dependencies, we advocate for estimating \te via estimation of two conditional entropy terms as shown in \figref{fig:te_ours}.
% 
In \ref{fig:te_joint_entropy}, kNN based estimation of \te requires estimation of four joint entropy terms in high dimensions which is a harder problem, unwarranted, and susceptible to dependency based noise.
% 
% , and doesn't account for dependency based noise; same applies for Copula based estimation of \tes.
% 
Same limitation applies to ITENE in \ref{fig:te_cmi}, as it computes \te as a difference of two mutual information terms in high dimensions.
% 
% In contrast, estimating conditional entropies should be a simpler problem leveraging upon expressive and robust discriminative models in modern machine learning.
% 
% \todo{try to explain most aptly why estimating joint entropies is problematic.}
}
\label{fig:te_forms}
\end{figure}
    
In practice, it is challenging to estimate \te especially under \emph{long-ranged \& noisy temporal dependencies}~\citep{lindner2011trentool,barnett2012transfer,zhang2019itene}.
        % 
A popular technique for estimating \te is based on the $k$ nearest neighbors~(kNN) method~\citep{kraskov2004estimating,lindner2011trentool}. However, this measures \te indirectly, through joint entropy (\cite{Kozachenko1987statistical,singh2016analysis}). This becomes problematic when temporal dependencies are noisy. A related approach is to estimate TE via Copula joint entropy or conditional mutual information~\citep{ma2019estimating,zhang2019itene}, which is also  susceptible to dependency based noise. See \figref{fig:te_forms}.
% 
% \todo{clarify through a figure how it becomes problematic, what we do we mean by dependency based noise exactly.}
        
Noting that \te is represented by two conditional entropy components, in this paper, we propose a discriminative learning approach to the problem. Specifically, we obtain \emph{in-sample} estimates of conditional likelihoods from a discriminative model to estimate the conditional entropy terms directly. This allows us to exploit modern machine learning methods to predict a low dimensional variable, conditioned on a very high dimensional variable, even when many of the dimensions are noisy. Any discriminative model, \emph{trained as per the maximum likelihood principle}, can obtain \emph{in-sample} estimates of the conditional log likelihood. 
% 
% \todo{Why in-sample?}
    
For instance, deep neural nets trained with mean squared error as the loss function provide estimates of the conditional log likelihood for free, if errors are assumed to be Gaussian distributed. In our approach, one can also employ any probabilistic regression model for obtaining the conditional likelihoods~\citep{dabney2018implicit,alexandrov2020gluonts,guen2020probabilistic,rasul2021autoregressive,gouttes2021probabilistic,tang2021probabilistic,pal2021rnn,yoon2022robust,das2022top}. Though we advocate for the simpler approach mentioned above as it accommodates a large variety of time series forecasters~\citep{bai2018empirical,oreshkin2019n,kitaev2019reformer,benidis2020neural,zeng2021topological,fan2021depts,gu2021efficiently,challu2022n}. For discrete-valued time series, one simple and generic choice is to obtain conditional log likelihood from any classifier trained with cross-entropy loss.
    
We must ensure that the in-sample estimates of conditional log likelihood are not overfit to the time series. This is particularly relevant when the estimate is for quantifying whether  additional information from another time series improves the~(in-sample) predictability of a given time series. Intuitively, a highly expressive discriminator~(timeseries forecaster) tends to learn a non-smooth function overfitting w.r.t. training data points sparsely populated in high-dimensional space. For this aspect, we take inspiration from 
the literature on adversarial learning for deep neural nets, where  high susceptibility of models to small noise, imperceptible to humans, is a common problem. Such a phenomenon is observed despite  standard regularization techniques, such as weight decay, dropout, batch normalization, etc. Quoting \cite{yoshida2017spectral}, ``adversarial training is designed to achieve insensitivity to the perturbation of training data"~\citep{goodfellow2014explaining,zhao2020maximum,dong2020adversarial}. While for the problem of \te estimation, there is neither an ``adversary'', nor a need to generalize to unseen domains~\citep{volpi2018generalizing}, the \emph{important takeaway} is that the functional outputs of an expressive discriminator must be regularized to be consistent w.r.t. the perturbations of its input, thus ensuring local Lipschitz-like properties of the output function~\citep{yang2020closer,jiang2020robust}. This allows a safe and robust in-sample estimation of \tes.
% 
% \todo{explain this through a figure}
    
The challenge is to select an appropriate perturbation model. Adding Gaussian noise is a popular choice. We argue in favor of an even more general perturbation model, which may be agnostic to the data distribution while respecting the underlying data manifold locally (since the desired smoothness of model outputs is \emph{local} in input space), and preferably non-stationary w.r.t. input space. Accordingly, we propose a novel perturbation model, based on \emph{locality sensitive hashing~(LSH)}, which is a well known technique for finding nearest neighbors in high dimensions~\citep{indyk1998approximate,kulis2009kernelized,grauman2013learning,zhao2014locality,wang2017survey}. As per the theory of LSH, a hashcode represents a local neighborhood in the input space, characterizing the underlying data manifold locally. We propose that the outputs of a discriminative model should be consistent within a hashcode bin~(one can think of it as a histogram bin in high dimensions), which is accomplished by generating perturbations local to a bin. 
    
Sampling perturbations from a hashcode bin doesn't introduce any bias since the hashcode bins correspond to data-driven histograms capable of characterizing the underlying true data distribution~\citep{lugosi1996consistency} as we show in our theoretical analysis of the estimator~(\secref{sec:theory}). 
% 
For practical purposes, perturbations are generated from a convex combination of the existing data points from the same bin~(i.e. sampling from within a convex hull of datapoints); see \figref{fig:hashcodes_on_manifold}.
% 
Furthermore, we define a simple yet effective information theoretic measure to ensure the consistency of the model outputs, i.e. \emph{minimize conditional entropy} of the model output given the locality sensitive hashcodes of perturbed inputs. 
    
The rest of this paper is organized as follows. After discussing the basics of \te and related works in \secref{sec:background}, we present a novel \te estimator in \secref{sec:robust_te_adv_reg_hash}, along with theoretical guaranties~(\secref{sec:theory}). A thorough empirical analysis using a synthetic dataset, a neuroscience dataset of activity in different brain regions, and two financial datasets of high frequency trading activity in US stocks is provided in \secref{sec:experiments}. Code is availed here: \url{github.com/morganstanley/MSML/papers/Direct_Estimate_Transfer_Entropy}.

\section{Background}
\label{sec:background}

Transfer entropy, introduced originally by \cite{schreiber2000measuring}, refers to the reduction in uncertainty for forecasting a time series given the knowledge of another time series. 
    
Let $X$, $Y$ be two discrete- or real-valued time series.
% 
% defined over some probability space $(\Omega, \mathcal{F}, \mathbb{P})$. 
% 
Let  $\cX_t$,  $\cY_t$ be the random variables denoting $X$, $Y$ at time $t$ respectively, where $t \in \{0, 1, \dots, \}$, and let $x_t$, $y_t$ be their observed values. Furthermore, let $\bs{\cY}_{t}$ denote the $t$-dimensional vector  $\bs{\cY_{t}}\equiv (\cY_0, \cdots, \cY_{t})$; and let $\mb{y}_{t}$ be a realization of $\bs{\cY}_t$. Then, the conditional entropy of $\cY_t$, given its past observations, i.e., the uncertainty in forecasting $Y$ for the current timestep, conditioned on its history, is represented as
% below.
% 
\begin{align}
% \nonumber
% \cH\left(
% \cY_{t} | \cY_0, \cdots, \cY_{t-1}
% \right)
% \equiv
\cH(\cY_t|\bs{\cY}_{t-1})
% \\
% \nonumber
% \cH(\cY_t|\bs{\cY}_{t-1})
% =
% \int_{\bs{\cY}_t} 
% p(\bs{y}_t)
% \log p(y_{t}|\bs{y}_{t-1}) d\bs{y}_t
% 
% \cH(\cY_t|\bs{\cY}_{t-1})
% =
\equiv
\mathbb{E}_{\mb{y}_{t-1} \sim \bs{\cY}_{t-1}}
\left[
\log p(y_{t}|\mb{y}_{t-1})
\right].
% 
\end{align}
% 
Uncertainty when forecasting time series Y, given the knowledge of both its own past as well as X's previous realizations, is expressed as the following conditional entropy:
% 
$
% \nonumber
\cH(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1}).
$
% 
Mathematically, \te is the difference between the two conditional entropy terms, measuring the additional information on $\cY_t$ available in the past realizations of $X$, that is not already present in the past of $Y$.
% 
\begin{align}
\label{eqn:transfer_entropy_ce_expr}
\cT_{X \to Y}
=
\cH(\cY_t|\bs{\cY}_{t-1})
-
\cH(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1})
& \geq 0 
\end{align}
% 
% Next we discuss limitations of the existing estimators of \tes.
        % 
% \subsection{Limitations of \te Estimators}
% 
A popular approach for estimating \te is based on $k$ nearest neighbors ~\citep{lindner2011trentool,zhu2015contribution}, which measures TE indirectly through joint entropy~\citep{Kozachenko1987statistical}, so \eqnref{eqn:transfer_entropy_ce_expr} must be re-written as,
% 
\begin{align}
% \label{eqn:transfer_entropy_je_expr}
&
\cT_{X \to Y} 
=
\cH(\cY_t|\bs{\cY}_{t-1})
-
\cH(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1})
\nonumber
\\
&=
\cH(\bs{\cY}_{t})
\!-\! 
\cH(\bs{\cY}_{t-1})
\!-\!
\cH(\bs{\cY}_{t}, \bs{\cX}_{t-1})
\!+\!
\cH(\bs{\cY}_{t-1}, \bs{\cX}_{t-1}).
\nonumber
\end{align}
% 
Due to significantly different scales of distances across these four terms, the error biases do not cancel each other. Attempts to correct the compounding of biases by estimating joint entropy terms together, using nearest neighbors in the joint space of all the variables, cannot adequately address vulnerability to the dependency based noise~\citep{kraskov2004estimating}, \cite{lindner2011trentool}. The above formulation is particularly problematic when $Y$ has a long memory, and consequently, the conditional random variable, $\bs{\cY}_{t-1}$, is high dimensional, with noisy dependencies w.r.t. the target variable $\cY_t$; for example, when only a few of the many dimensions in $\bs{\cY}_{t-1}$ explain $\cY_t$, while the other ones are noise components. Same applies for noisy dependencies of $\bs{\cX}_{t-1}$ w.r.t. $\cY_t$, when estimating $\cH(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1})$. When estimating \te between time series from real world domains like finance and neuroscience, it is natural to expect such long ranged, noisy temporal dependencies. In a recent work, ~\cite{ma2019estimating} shows \te to be equivalent to Copula entropy, but their estimator also relies upon estimating joint entropy terms in high dimensions.
        % 
Kernel density or histogram-based estimators~\citep{verSteeg2012WWW,zuo2013adaptive} also suffer from dependency based noise. Even without noise, these estimators are efficient only in low dimensions.
% , which is a hard problem on its own.
    
\te can also be expressed as conditional mutual information: 
% 
\begin{align}
% &
\cT_{X \to Y}
% =
% \cI(\bs{\cX}_{t-1}:\cY_t | \bs{\cY}_{t-1}) 
% \nonumber
% \\
% & 
= 
\cI(\bs{\cX}_{t-1}: \bs{\cY}_{t}) - \cI(\bs{\cX}_{t-1} : \bs{\cY}_{t-1}).
\label{eqn:cmi_te}
\end{align}
% 
While theoretically appealing, it is notoriously difficult to estimate mutual information between two high dimensional variables. \cite{zhang2019itene} propose to estimate \te through \eqnref{eqn:cmi_te} using the deep neural nets based estimator of mutual information~(MINE) due to \cite{belghazi2018mutual}. \cite{mcallester2020formal} show that MINE has high variance which increases with true mutual information value itself, which renders it unsuitable for estimating TE.
% 
Next, we introduce a novel estimator of \te in \secref{sec:robust_te_adv_reg_hash}.

\begin{figure}[!t]
\centering
\subfigure[2-D space.]{
\includegraphics[
width=0.54\columnwidth]{adv_gen_reg.png}
\label{fig:2d_hashcodes}
}
\subfigure[2-D manifold.]{
\includegraphics[
width=0.35\columnwidth]{hashcodes_on_manifold.png}
\label{fig:3d_hashcodes}
}
\caption{This figure illustrates the concept of hashcode-based perturbations to regularize the output function. In \ref{fig:2d_hashcodes}, data points~(\emph{Red} or \emph{Yellow} dots) are dispersed in 2-D space; the  lines represent hash functions, and their intersections correspond to \emph{hashcode bins}. Each bin represents a local neighborhood in the input space, data points in a bin being neighbors of each other. 
% 
Owing to the locality of a hashcode bin w.r.t. the data manifold, perturbations are generated~(\emph{Blue} dots) in a bin from randomly sampled convex combinations of the existing data points.
% 
We propose that model outputs for the perturbations within a hashcode bin should be consistent w.r.t. each other.
% 
In \ref{fig:3d_hashcodes}, hashcode bins from \ref{fig:2d_hashcodes} are shown on a manifold embedded in 3-D space, illustrating how hashcode bins represent the data manifold locally, thereby leading to a non-stationary perturbation function. For instance, the bin corresponding to yellow dots has a data distribution of highest entropy, thus implying perturbations of the greatest magnitude in the bin.}
\label{fig:hashcodes_on_manifold}
\end{figure}

\section{Direct Estimate of \tes}
% 
\label{sec:robust_te_adv_reg_hash}
% 
We propose a direct empirical estimate of \te leveraging upon highly expressive timeseries forecasting models such as in deep learning. While the key idea is simple and general, we discuss how and why this approach should work well in practice despite potential concerns such as model mis-specification, mis-calibration, etc. Moreover, we introduce a novel perturbation based regularization model for ensuring a robust estimate of \tes. In \secref{sec:theory}, we establish that the estimator is consistent with low variance.
% 
% the perturbations based regularization do not introduce any bias into the estimator, and the estimator has zero bias of its own owing to universal approximation property of neural nets.
            
Referring back to \eqnref{eqn:transfer_entropy_ce_expr} in \secref{sec:background}, we propose to estimate \te via the direct estimation of the \emph{conditional entropy} terms.
% 
% A direct empirical estimate of the conditional entropy term, 
% 
$\cH(\cY_t|\bs{\cY}_{t-1})$ is expressed as,
% 
\begin{align}
\hat{\cH}(\cY_t|\bs{\cY}_{t-1})
=
- \frac{1}{n} \sum_{i=1}^n \log p(y_t^{(i)}|\mb{y}_{t-1}^{(i)}).
% \nonumber
\end{align}
% 
Identical estimation logic applies for $\cH(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1})$. Here, we make an important observation: a discriminative model that is trained by \emph{maximizing the conditional log likelihood} of the target variable given the input variable can be employed as an estimator of conditional entropy.
    
For discrete-valued time series with support set $Z$, a classifier trained by cross entropy loss can provide an empirical estimate of conditional entropy itself.
% 
\begin{align}
\hat{\cH}_q(\cY_t|\bs{\cY}_{t-1}) 
&=
\frac{1}{n}
\sum_{i=1}^n 
-
\sum_{z \in Z} 
\mathbb{I}_{y_t^{(i)}=z}
\log 
f_z(\bs{\mb{y}_{t-1}^{(i)}})
% \nonumber
\\
&=
\frac{1}{n} 
\sum_{i=1}^n 
-
\log 
q(y_t^{(i)} | \mb{y}_{t-1}^{(i)}), 
% \nonumber
% 
\end{align}
% 
where $\mathbb{I}(.)$ is an indicator function; $f(\mb{y}_{t-1}^{(i)})$ is output of the classifier, a multinomial vector of inferred class probabilities; $\log q(y_t^{(i)} | \mb{y}_{t-1}^{(i)})$ is an estimate of conditional log likelihood. By Jensen's inequality, it is well known that a proposal distribution $q(.)$ for $p(.)$ upper bounds the corresponding entropy function, $\cH_q \geq \cH_p$, with the error bias being KL-divergence between the two distributions, $\cD_{KL}(p(.|.)||q(.|.)) > 0$~\citep{cover1999elements}.
    
For continuous valued time series, any regression model trained with mean squared error as the loss function can be employed to estimate conditional entropy if  errors are assumed to be Gaussian distributed:
% 
$y_t \sim \cN(f(\mb{y}_{t-1}), \sigma)$.
% 
$f(\mb{y}_{t-1})$ can be a highly expressive deep neural net which can essentially approximate any functional relationship between target $y_t$ and input $\mb{y}_{t-1}$.

Besides the above two simple and generic choices which are prevalent in literature of supervised deep learning, any deterministic discriminator~(timeseries forecaster) trained with maximum likelihood objective or a probabilistic discriminator is equally applicable here.
% for estimating \tes.

% \todo{not limited to mean squared error for regression with conditional likelihood.}

% \todo{Heteroskedastic noise in continuous-valued timeseries.}

As per the above, a discriminator can act as a conditional entropy estimator with the advantage of being efficient even in very high dimensional noisy feature spaces.
% 
The empirical estimate of \te is directly expressible in terms of the ratio of the two conditional likelihood terms, $q(y_t|\mb{y}_{t-1}, \mb{x}_{t-1}))$ and $q(y_t|\mb{y}_{t-1})$.
% 
The estimator has an \emph{error bias} naturally inherited by the model bias in the discriminator. 
% 
The choice of model including its hyperparameters, $q(.|.)$, is the same for estimating the two terms. This should help reducing the model bias since the empirical estimate relies only on the ratio of the two terms. 

Furthermore, since the goal is to quantify the decrease in uncertainty and not to maximize the accuracy of the original model for forecasting, from many choices of neural architectures for timeseries forecasting such as TCNs, RNNs, Transformers, etc., one which is known to be more robust to overfitting, model mis-calibration, is preferred.
% 
% Since our goal is to quantify the decrease in uncertainty, and not to maximize the accuracy of the original model for forecasting, simpler architectures are preferred. 
% 
One can even use more lightweight versions of neural architectures than those used for forecasting, and employ standard generalization techniques such as weight decay, dropout, early stopping, small batch size, etc.
        
In theory, the error bias for estimating \te with conditional likelihood $q(.|.)$ stemming from a discriminator, as opposed to true conditional likelihood $p(.|.)$, is as below.
% 
\begin{align}
\cT_{X \to Y}^q
-
\cT_{X \to Y}
=
\cD_{KL}(p(y_t|\mb{y}_{t-1})||q(y_t|\mb{y}_{t-1}))
\nonumber
\\
-
\cD_{KL}(p(y_t|\mb{y}_{t-1}, \mb{x}_{t-1})||q(y_t|\mb{y}_{t-1}, \mb{x}_{t-1}))
% \nonumber
% 
\label{eqn:te_q_error}
\end{align}
% 
The two terms in the r.h.s. are KL-divergence terms which are error biases of $\cH^q(\cY_t|\bs{\cY}_{t-1})$ and $\cH^q(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1})$ respectively. Since both terms are non-negative, $\cD(.||.)) \geq 0$, they \emph{counteract} each other leading to a smaller magnitude of the overall error bias of TE. Theoretically, one should expect the error bias of $\cH^q(\cY_t|\bs{\cY}_{t-1}, \bs{\cX}_{t-1})$ to be larger than or equal to its counterpart due to the conditioning upon a higher number of dimensions; this should lead to a net negative error bias. In addition, there is also a bias due to the finite sample size for both of the conditional entropy terms, which could be positive or negative; we analyze variance of the estimates due to finite samples size in \secref{sec:theory}.
    
% $q(.|.)$ is a discriminator that is optimized in the expression cited in the remark under Corollary 1. $y_t$ is the ground truth data point (not a variable under optimization), $q(.|.)$ is learned by maximizing the conditional likelihood of $y_t$.
% 
% Note that the choice of model, $q(.|.)$, is the same for estimating either of the two conditional likelihood terms.
% 
% Optimization of weights (through backpropagation) is performed separately for the two, since their inputs differ. Using a single model (including the hyperparameters) for estimating the two conditional likelihood terms especially 
% 
% This also helps \emph{reduce model bias} given that the TE estimator relies only on the ratio of the two terms.
    
% Furthermore, from many choices of neural architectures for timeseries forecasting, one can choose one which is known to be more robust to overfitting, model mis-calibration, etc.
% % 
% Since our goal is to quantify the decrease in uncertainty, and not to maximize the accuracy of the original model for forecasting, simpler architectures are preferred. 
% % 
% One can even use more lightweight versions of neural architectures than those used for forecasting, and employ standard generalization techniques such as weight decay, dropout, early stopping, small batch size, etc.
% % 
% % Further, we decided on architectural choices and hyperparameters such that it is less sensitive to the initialization of weight parameters.
    
For practical purposes, we suggest normalizing \te with the first conditional entropy term in \eqnref{eqn:transfer_entropy_ce_expr}, quantifying relative decrease in uncertainty. This measure is more robust to a potential issue of model mis-calibration.
           
For ensuring a robust estimate of \tes, we also propose a perturbations based regularizer.
            
\subsection{Regularize via LSH-Perturbations}

Machine learning literature provides a plethora of discriminative models for time series modeling which can be employed in our TE estimator. The challenge, however, is that an expressive discriminator like a deep neural net, with an ability to learn any function, can overfit to training data, even under standard generalization techniques such as weight decay, dropout, etc. In sparsely populated regions of the underlying manifold of model inputs, the output function may be non-smooth. 

% \todo{Model mis-specification and mis-calibration.}

Inspired by recent works in adversarial training of deep neural nets~\citep{goodfellow2014explaining,zhao2020maximum,dong2020adversarial}, we propose to accomplish Lipschitz-like smoothness of the output function by ensuring that model outputs are consistent w.r.t. perturbations in inputs.  A good choice for a perturbation model should be able to characterize the underlying data manifold locally, since the perturbations are supposed to be local w.r.t. inputs.
    
The general concept of data perturbations based regularization can be formalized as below.
% 
\begin{align}
\bar{\mb{y}} \sim g(\mb{y})
\label{eqn:perturb_func}
\end{align}
% 
Here, $g(.)$ is a model that generates perturbations for a given input $\mb{y}$. An explicit way to ensure that the model output function, $f(.)$, is consistent w.r.t. perturbations in input, i.e.,  $f(\mb{y}) = f(\bar{\mb{y}})$, is to define a regularization penalty for inconsistent model outputs on perturbations. Later in this section, we present an information theoretic regularization criterion which expects model outputs on perturbations of a given input to be of low entropy. One can even employ a non-invasive regularization by tuning the hyper-parameters of a model. An implicit way to ensure consistency  is to augment the training data with perturbed data points. Data augmentation is not used here to generalize to unseen domains, but to  ensure that the learned model output function is smooth in the local vicinity of training data points, especially if those points were sparsely populated in the input space.
        
A perturbation model challenges the primary (discriminative) model by perturbing its inputs locally in the data manifold. It can be   parametric or non-parametric: perturbations based on Gaussian noise are described in ~\citep{rothfuss2019conditional,maaten2013learning,bishop1995training}. A good choice of perturbation model should characterize the data manifold locally, whereas the primary model may be inefficient at modeling the local manifold. While not necessary for practical purposes, if the perturbation model can also characterize the underlying true data distribution, $p(\mb{y}_{t-1}, y_t)$, it would theoretically ensure that there is no error bias from using the perturbations based regularization or data augmentations for estimating \te as we show in \secref{sec:theory}.
        
We propose a perturbation model based upon \emph{locality sensitive hashing}~(LSH), which perturbs inputs in local neighborhoods of input space. Such neighborhoods are defined from the hashcodes that split the input space into different regions. LSH is a randomized algorithm, that is proven to be efficient in finding nearest neighbors  in very high dimensions~\citep{indyk1998approximate,zhao2014locality,wang2017survey}. The core idea is that similar data points according to some distance metric are assigned the same hashcodes with probability inversely proportional to the distance metric.
            
This theoretical property of hashcodes implies that a hashcode bin represents a local neighborhood~(manifold) in input space. We propose a hashcode-based regularization such that model outputs are smooth w.r.t. perturbations of inputs within a hashcode bin. 
% 
% Since a hashcode bin represents the data manifold \emph{locally}, 
% 
We generate perturbations from randomly sampled \emph{convex combinations} of the existing data points in a bin; see \figref{fig:hashcodes_on_manifold} for an illustration. In essence, LSH plays the role of histograms in high dimensions.
% 
% Note that any LSH model should give the same hashcode for all perturbations generated within the convex hull. 
% 
In histogram bins, one has explicit boundaries of bins, so sampling from within a bin is easier. 
Whereas in a high dimensional setting, the boundaries of a bin can only be estimated, as an example from the convex hull of the data points in that bin.
% 
% On a related note, there can be unobserved data points outside the convex hull but within the unknown boundaries of the hashcode bin; from that perspective, perturbations from the convex hull can lead to a sampling bias in practice.
    
One advantage of this approach is that the perturbation model is non-stationary w.r.t. the input space,  since the perturbations are generated locally w.r.t. hashcode bins. Furthermore, the perturbation model does not make any parametric assumption about the global distribution of data or the data manifold. Mathematically, \eqnref{eqn:perturb_func} can be re-expressed for the hashcodes based perturbation function as follows.
% 
\begin{align}
\bar{\mb{y}} \sim g(\mb{c})
\  s.t.\ 
\mb{h}(\mb{y}) = \mb{h}(\bar{\mb{y}}) = \mb{c},
% \nonumber
\end{align}
% 
where, $\mb{h}(.)$ is an LSH function, represented by a set of $H$ hash functions, each outputting one bit, mapping an input $\mb{y}$ to its hashcode $\mb{c} \in \{0, 1\}^H$. The perturbation model $g(.)$ samples a perturbation w.r.t. a hashcode bin, and not a single input. In practice, it is efficient to sample all the perturbations together across all the hashcode bins.
    
A pseudo code is presented in \algoref{alg:perturb_lsh}. The input data points $(\mb{y}^{(1)}, \cdots, \mb{y}^{(n)})$ are the inputs of a model that we regularize, i.e. a discriminator for our problem. First, we compute hashcodes for all the data points and arrange the data into their corresponding unique hashcode bins. For each bin, a Dirichlet distribution is then initialized with hyperparameter $\alpha$, and of dimension equal to the number of data points in the bin. For $n_i$ data points in the $i_{th}$ bin, we sample $n_ib$ number of perturbations from that bin. Each perturbation is sampled in-turn by randomly drawing a  multinomial vector from the Dirichlet distribution, which acts as a random convex combination of all the points in the bin.
            
% \todo{Convex combinations of points mapped to the same hashcode are still mapped to the same hashcode. Is this a trivial fact?}
    
Since we are interested in local smoothness given by hashcodes bins representing local neighborhoods, we propose a regularization criterion based on information theory. In particular, we minimize the conditional entropy of the model outputs given the hashcode representation of the inputs:
% 
\begin{align}
\min_{f(.)} \cH(f(\bs{\bs{\cY}})|\bs{h}(\bs{\bs{\cY}})).
\label{eqn:hash_perturb_reg}
% \nonumber
\end{align}
% 
Although $f(.)$ \& $\mb{h}(.)$ are deterministic, both $f(\bs{\bs{\cY}})$ \& $\bs{h}(\bs{\bs{\cY}})$ are stochastic given their dependence on $\bs{\cY}$. Empirical estimate of this regularization term is easy and cheap to compute. For each hashcode bin, we compute the model outputs for existing as well as the sampled perturbations from \algoref{alg:perturb_lsh}, and then we compute an empirical estimate of entropy of those outputs within the bin. Since the target variable is one-dimensional in our problem, i.e. observation $y_t$ for timestep $t$, computing entropy of the model outputs is easy even for the case of conditionals densities; one can, for example, use non-parametric estimators like histograms. This way, we iterate through all the bins to finally compute the conditional entropy term, $\cH(f(\bs{\bs{\cY}})|\bs{h}(\bs{\bs{\cY}}))$.

We use the regularization criterion in \eqnref{eqn:hash_perturb_reg} in a non-invasive manner to either tune the vast space of hyperparameters of a model like GBM,  or for data augmentation to regularize the model with LSH perturbations. As for tuning LSH model, both the type and number of hash functions can be tuned independently of the TE estimation problem.
% 
It’s valuable to keep the mean (minimum) number of data points per bin above a certain threshold. The threshold values can be decided intuitively so as to impose the desired regularization upon the neural output function.
        
% \todo{Eq. 12 in Sec. 3.1, how does it enter optimization?}

\begin{algorithm}[tp!]
\caption{Generate Perturbations via LSH}
% 
% \csizenine
% 
\begin{algorithmic}[1]
%
\REQUIRE{$\{ \mb{y}^{(1)}, \cdots, \mb{y}^{(n)} \}$, $\alpha$, $b$}\\
% 
\STATE $\mb{c}^{(1)}, \cdots, \mb{c}^{(n)} \gets $ computeHashcode($ \mb{y}^{(1)}, \cdots, \mb{y}^{(n)} $)
% 
\STATE $\mb{Y}^{(1)}, \cdots, \mb{Y}^{(m)} \gets$ hashcodeBin($\{ (\mb{y}^{(1)}, \mb{c}^{(1)})\}_{i=1}^n$) \COMMENT{$\mb{Y}^{(i)}$ has all the inputs from $i_{st}$ hashcode bin}
% 
\FOR{$i=0 \to m$}
% 
\STATE $n_i \gets$ countSamplesInBin($\mb{Y}^{(i)}$)
% 
\STATE $\bar{n}_i \gets n_i*b$ \COMMENT{no. of perturbations in $i_{th}$ bin}
% 
% 
\FOR{$j=0 \to \bar{n}_i$}
% 
\algshade{
% 
\STATE $\mb{\pi}_j \sim Dir(\alpha \mb{1}_{n_i})$ \COMMENT{sample convex combination}
% 
\STATE $\mb{\bar{y}}^{(i)}_j \gets \mb{\pi}_j*\mb{Y}^{(i)}$ \COMMENT{perturbation in the bin}
% 
}
% 
\ENDFOR
% 
\ENDFOR
% 
\STATE \textbf{Return} $\{ ( \mb{\bar{y}}_j^{(i)}, \mb{c}_j^{(i)} ) \}_{i=1, j=1}^{m,n_i}$
% 
\end{algorithmic}
% 
\label{alg:perturb_lsh}
\end{algorithm}
        
While our approach admits \emph{any} LSH algorithm, we also propose a novel~(greedy) algorithm for unsupervised learning of locality sensitive hash functions and use it for our experiments. See supplementary material for details.
        
\subsection{Theoretical Analysis}
\label{sec:theory}

As discussed above, we use locality sensitive hashing~(LSH) based regularization for learning the conditional log likelihood estimates by perturbing the inputs within the same hashcode bin. Perturbation may lead to a different distribution than the data distribution, and thus the conditional likelihood estimates derived from this distribution may be biased. We establish the conditions under which the perturbed distribution yields consistent estimates. 
% 
Let $g_{n,H}(.)$ denote the histogram distribution obtained by using H locality sensitive hash functions and $n$ samples. 
% 
We will see LSH based data generation as sampling from a data-driven histogram.
% 
Using this insight, results from \citet{lugosi1996consistency} and a proof technique similar to \citet{rothfuss2019conditional}, we demonstrate consistency of our sampling approach under some regularity conditions. 
% 
\begin{theorem}
\label{thm:consistency}
Let $\lim_{n \to \infty}$ $\frac{2^H}{n}$ $\to$ $0$, $\lim_{n \rightarrow \infty} \frac{tH\log n}{n} \rightarrow 0$ and the input space, $\mb{y} \in \mathbb R^{t}$ is bounded. Consider any function, $f: \mb{y} \rightarrow \mathbf (0, \infty) $ with $\log f$ having finite second order moment w.r.t to $p$ and $g_{n,H}$. Then,  
% 
\begin{align}
\lim_{n\rightarrow \infty}
\left| 
\mathbb E_p \left[-\log f(\mb y)\right]
-
\mathbb E_g [-\log f(\mb y )] 
\right| \rightarrow 0.
% \nonumber
\end{align}
\end{theorem}
% 
The above result establishes that the perturbed distribution will yield same estimates in expectation as $n$ becomes large.
% 
% We provide a formal proof for this in Appendix~\ref{subsec:consistency}.
% 
As a corollary, we can establish the consistency of our \te estimator. We can rewrite,
% 
\begin{align}
\mathcal T^q_{X\rightarrow Y} = 
\mathbb E_p 
\log 
\frac
{q(y_t | \mb{y_{t-1}}, \mb{x_{t-1}})} 
{q(y_t | \mb{y_{t-1}})},
% \nonumber
\end{align} 
% 
and the \te estimated using $g_{n,H}$ be, 
% 
\begin{align}
\mathcal {T}^{q,g}_{X\rightarrow Y} = \mathbb E_g \log 
\frac
{q(y_t | \mb{y_{t-1}}, \mb{x_{t-1}})} 
{q(y_t | \mb{y_{t-1}})}.
% \nonumber
\end{align} 
% 
\begin{corollary}
% 
If the conditions in Thm.~\ref{thm:consistency} hold, and the model distribution, $q>0$ everywhere. Then letting $f(\mb{y_{t}}, \mb{x_{t-1}})= \frac
        {q(y_t | \mb{y_{t-1}}, \mb{x_{t-1}})} 
        {q(y_t | \mb{y_{t-1}})}$, we have, 
% 
\begin{align}
\lim_{n\rightarrow \infty} \left | 
    \mathcal T^q_{X\rightarrow Y} 
    - 
    \mathcal {T}^{q,g}_{X\rightarrow Y} 
\right| \rightarrow 0.  
% \nonumber
\end{align}
% 
\end{corollary}
% 
\textbf{Remark:} To find a good discriminator $q$, we optimize the LSH regularized MLE objectives, i.e.,
$\min_q - \mathbb E_{g_{n,H}} \log q(y_t| . ).$ As $n$ becomes large, this is the same as computing expectation over data distribution $p$ due to Thm~\ref{thm:consistency}, 
% 
$\min_q - \mathbb E_{p} \log q(y_t| . ).$
% 
If the function class is expressive enough, such as a neural network for which the universal approximation theorem holds, the optimal discriminator would correspond to the correct conditional distribution derived from population distribution~\citep{lu2020universal}.
    
% \todo{In the remark under Corollary 1, why  enters  as a variable? In Eq. 9, why the same symbol  for two different estimators?}
    
The above results characterize the distribution of perturbed samples and the behaviour of estimates under that distribution. Another aspect of our approach is that it relies on finite sample size, and thus, we next characterize sample complexity of our estimator to obtain high-confidence estimates.
% 
\begin{theorem}
\label{thm:confidence}
% 
For some data distribution $p$ and conditional model distribution $q$  and $-\log q(y_t|.) \in [-Q, Q]$. Let ${\hat \cT}^q_{X\rightarrow Y}$ denote the n-sample estimate of transfer entropy. Then with probability $1-\delta$ ($\delta>0$), we have
% 
\begin{align}
| 
{ \mathcal {\hat T}^q_{X\rightarrow Y}} - \mathcal T^q_{X\rightarrow Y}  
| 
\leq 
2 Q \sqrt {(2/n) \ln(4 / \delta)}.
% \nonumber
\end{align}
% 
\end{theorem} 
% 
As a consequence of the above result, we can bound the error variance as below: 
% 
\begin{align}
\mathbb E( { \mathcal {\hat T}^q_{X\rightarrow Y}} - \mathcal T^q_{X\rightarrow Y}  )^2 
\leq 
4Q^2
(
(
1-\delta)  {\frac {2}{n}\ln \frac 4 \delta} + \delta
).
% \nonumber
\end{align}
% 
The first term is the dominant term in above expression and thus, the variance of the estimator reduces at the rate of ${\frac 1 n }$. 
    
% See Appendix for the proofs.
    
% \subsection{Practical Considerations}
% 
% \todo{on Model Mis-specification and Mis-calibration}
        
\section{Empirical Evaluation}
\label{sec:experiments}
% 
% We present empirical results to illustrate the performance of our algorithm in estimating \tes. 
% 
First, to demonstrate the efficacy of the proposed estimator when the ground truth is known, we evaluate on a synthetic dataset~(\secref{sec:syn_data}). We also perform extensive analysis on two real world examples: a neuroscience dataset~(\secref{sec:neuro_data}) and a dataset representing US stock market activity~(\secref{sec:nyse_data}).
    
\paragraph{Estimators for Comparison}

We evaluate four different baseline estimators of \te from the literature: 
(i) kNN estimator:~\emph{kNN};
(ii) Conditional kernel density estimation:~\emph{CKDE};
(iii) Copula entropy based estimator of \cite{ma2019estimating}, referred as \emph{Copent}.
(iv) Conditional mutual information based estimator of \cite{zhang2019itene}, referred as \emph{ITENE}.
% 
To estimate \te in terms of conditional entropies directly using a discriminative model, we employ two neural models: Temporal Convolution Networks~(TCN) and \emph{N-Beats}, as well as Gradient Boosted Machines~(GBM).  LSH indicates that it is a perturbation model based on locality sensitive hashing~(LSH): LSH-RC imposes regularization penalty for inconsistent model outputs, while LSH-A involves data augmentation. The GBM model uses LSH-RC for tuning a large set of hyper-parameters~(GBM-LSH-RC*). For neural models, we use LSH-A~(TCN-LSH-A* \& NBeats-LSH-A*). For the baseline of Gaussian noise as a perturbation model, we likewise have GN-RC and GN-A. If no perturbation model is used, it is referred to as ``No Reg.", another baseline. Standard regularization techniques like weight decay and dropout are used for all models, including ``No Reg.". For continuous time series, the models are used as regressors trained with mean squared loss, and  for discrete-valued time series as classifier with cross entropy loss. Each model has its own strength depending on the nature of data, so in some cases, we present results for the best of the three discriminative models ~(GBM, TCN, NBeats) accordingly. 
    
\paragraph{Parameter Settings}
% 
In regards to tuning a discriminator~(timeseries forecaster), hyperparameters are tuned for $q(y_t|\mb{y}_{t-1})$ alone, which is then used for $q(y_t|\mb{y}_{t-1}, \mb{x}_{t-1})$ as well. For instance, if we want to compute transfer entropy from every timeseries $j$ to a given timeseries $i$, we tune the hyperparameters of the discriminator only once, just using timeseries $i$.
% 
In  the perturbation model, we generate new samples 10 times the original number of samples, i.e. b=10 in \algoref{alg:perturb_lsh}~(for LSH-A, $b=3$). For sampling from Dirichlet distribution in \algoref{alg:perturb_lsh}, $\alpha$=0.1. Number of hash functions for LSH is 15, $H=15$. These parameters do not require fine-tuning, so they were set manually.
% 
We explored various values of k for kNN estimator; k = 1, 3, 5 were equally good across all experiments, compared to a higher value of k.

% \todo{how to tune the discriminator?}

% \todo{mention value of $k$ in kNN estimator.}

% \todo{How to make the estimator work from a practical perspective? For instance, how to select the number of hash functions, the type of hash function, and the type of neural network?}
    
\begin{figure}[!t]
% 
\centering
% 
\includegraphics[width=0.98\columnwidth]{legend.png}
% 
\subfigure[Noisy Dependency, t=10]{
\includegraphics[
width=0.5\columnwidth]{te_synthetic_t50_k10_n3000.png}
\label{fig:expr_syn_50d_t10_noisy}
}
% 
% 
\subfigure[Noisy Dependency, t=20]{
\includegraphics[
width=0.435\columnwidth]{te_synthetic_t50_k20_n3000.png}
\label{fig:expr_syn_50d_t20_noisy}
}
% 
\subfigure[t=5]{
\includegraphics[
width=0.5\columnwidth]{te_synthetic_t50_k5_n3000_all_dependent.png}
\label{fig:expr_syn_50d_t5}
}
% 
% 
\subfigure[t=10]{
\includegraphics[
width=0.435\columnwidth]{te_synthetic_t50_k10_n3000_all_dependent.png}
\label{fig:expr_syn_50d_t10}
}
% 
\caption{Estimates of \te are plotted w.r.t. the groundtruth for a synthetic dataset; $t$ refers to the time lag for how far back we look into the past to forecast for the current timestep. 
% 
In \ref{fig:expr_syn_50d_t10_noisy} and \ref{fig:expr_syn_50d_t20_noisy}, $y_t$ has dependence w.r.t. only one of the dimensions of $\mb{x}_{t-1}$ whereas in \figref{fig:expr_syn_50d_t5} and \figref{fig:expr_syn_50d_t10}, all the dimensions of $\mb{x}_{t-1}$ have dependence w.r.t. $y_t$.
% 
The suffix ``*" refers to the proposed TE estimator.}
\label{fig:expr_syn}
\end{figure}

\subsection{Evaluation on Synthetic Data}
\label{sec:syn_data}

\begin{figure*}[!t]
% 
\centering
%
\subfigure[NBeats-LSH-A*]{
\includegraphics[width=0.47\columnwidth]{te__nbeats_lsh.pdf}
}
% 
%
\subfigure[kNN]{
\includegraphics[width=0.47\columnwidth]{te__knn.pdf}
}
% 
%
\subfigure[Copent]{
\includegraphics[width=0.47\columnwidth]{te__copent.pdf}
}
% 
% 
\subfigure[ITENE]{
\includegraphics[width=0.47\columnwidth]{te__itene.pdf}
}
%
\caption{Estimates of \te between mouse visual areas. The matrices shows the \te values $\mathcal{T}_{\text{Source}\to\text{Target}}$ with source regions along columns and target regions along rows. The brain regions are sorted by ascending hierarchical order from left to right and top to bottom. V1 can be seen as the gateway of the visual system. The result of our method in (a) reveals the brain structure that matches current knowledge of the visual system, while others do not.}
\label{fig:expr_neuro}
\end{figure*}


Our algorithm for generating binary valued synthetic data is as follows.
First we draw 3000 samples for the target variable, $y_t$, s.t. $p(y_t=1)=0.5$. We assume zero temporal dependencies within time series Y, i.e. $H(\cY_t|\bs{\cY}_{t-1})=H(\cY_t)=\log(2)$.
% 
Next, we randomly select one of the $t$ dimensions of the conditioned variable $\mb{x}_{t-1}$, denoted by $x_r$, such that it depends on the target  $y_t$, and the rest of the $t-1$ dimensions are independent of $y_t$; $p(x_r=1|y_t=1)=q, p(x_r=1|y_t=0)=1-q$. This implies, $H(\cY_t|\bs{\cX}_{t-1}, \bs{\cY}_{t-1})= - q \log q - (1-q)\log(1-q)$, which is basically the entropy of a biased coin with probability of head equal to $q$.  Overall, $\cT_{X \to Y} = \log(2)- q \log q - (1-q)\log(1-q)$. For $q=0$ or $q=1$, $\cT_{X \to Y}$ attains its maximum value of $\log(2)$. For $q=0.5$, that is using an unbiased coin to sample $x_r$ given $y_t$, there is no transfer of entropy, $\cT_{X \to Y} = 0$. 
% 
We generate synthetic data for varying values of $q$, from $q=0$ to $q=0.5$. 
% 
The most important aspect of this data generation step is that only \emph{one} of the many dimensions in $\mb{x}_{t-1}$ is dependent on $y_t$, while the others are not. In \secref{sec:background}, we referred to this as \emph{noisy dependency}.
    
In \figref{fig:expr_syn}, we present experimental results for the estimation of \te in the synthetic dataset.\footnote{
Copent estimator is excluded from the figure since its estimates are way beyond the range of true TE.} Here, we only use GBM model as a discriminator. The figures differ by the dimension of the conditioned variables~($t=20$, $t=10$, $t=5$), $\mb{x}_{t-1}$ or $\mb{y}_{t-1}$, i.e. how far back we look into the past for forecasting time series $Y$.
    
In reference to \figref{fig:expr_syn_50d_t10_noisy} and \figref{fig:expr_syn_50d_t20_noisy}, the kNN-based estimator provides a severe overestimate of \tes, and is almost agnostic to the dependency of $x_r$ on $y_t$, driven significantly by noisy signal from all the other $2t-1$ dimensions. Its error reduces when dependency between $x_r$ and $y_t$ approaches the highest value. Despite the popularity of the estimator, the results are unsurprising for the aforementioned technical reasons. CKDE estimates correlate to the ground truth values of \tes, but with a significant error bias. ITENE obtains the \te estimates with very low error bias.
% 
% \footnote{We observed that variance of the \te estimates from ITENE increases if we were to consider longer time lag~(say, $t=100$).}
% 
Our approach of LSH-RC* also has very low estimation errors, although there are a few instances where it is high. One challenge is to optimize the trade-off between the log likelihood objective and the regularization term. In contrast, the baseline approaches of directly using a discriminator without regularization~(No Reg.), and the Gaussian noise based data augmentation are both completely unsuccessful. Regularization using Gaussian noise based data perturbation model~(GN-RC) seems to work for smaller time lag of $t=10$.
        
Besides the above settings of noisy dependency, in \figref{fig:expr_syn_50d_t5} and \figref{fig:expr_syn_50d_t10}, we present results for the case of $y_t$ being dependent w.r.t. all the dimensions of $\mb{x}_{t-1}$. In this setting, while our model obtains the best estimates, the baseline models perform relatively better than the former setting.
            
It is worth noting that the above described process of generating synthetic data is not artificial, but recreates highly noisy temporal dependencies observed between and within time series from domains such as neuroscience and finance.

\subsection{Evaluation on Neuroscience Data}
\label{sec:neuro_data}
% 

Next, we applied the method to the neuroscience dataset, Allen Brain Observatory--Visual Coding Neuropixels~ \citep{siegle2021survey}, by offering a metric from information theory perspective to discover the structure of the mouse brain and verify whether the results agree with the current findings of the visual system.
Both anatomical and functional studies have shown that the brain visual system is hierarchically organized and the visual information propagates across the cortical areas in order accordingly~\citep{siegle2021survey, harris2019hierarchical}. During the early visual process, it is expected to see that the activities of low-order regions drive those of high-order regions.
    
Fig. \ref{fig:expr_neuro} presents the results, showing estimated TEs between regions. The columns indicate the source regions and the rows indicate the target regions.
The hierarchical order of the brain regions, from low to high, is V1, LM, RL, AL, and AM \citep{harris2019hierarchical, siegle2021survey}, which are sorted along rows and columns. 
A larger value indicates the source region contributes more significantly to the target region's entropy, which implies the direction of information flow and the source region has an impact on the activity of the downstream target region. 
In Fig. \ref{fig:expr_neuro}(a), all large values concentrate in the bottom left triangle, which means the low-order regions impact the high-order regions, thus the conclusion agrees with the hierarchical order found by other anatomical or functional methods.
In contrast, other methods in Fig. \ref{fig:expr_neuro}(b), (c), and (d) do not properly reveal the hierarchical relationships among the visual areas, especially they present many large positive TE values in the top right triangle matrix. For example, these methods show $\mathcal{T}_{\text{AM}\to\text{V1}} > 0$, meaning AM impacts V1, which is not reasonable bio-physically.
% 
% More analysis can be found in Appendix \ref{sec:appendix_neuro}.

\begin{figure*}[!t]
\centering
\includegraphics[width=2.0\columnwidth]{te_legend.png}
\subfigure[
% \csize
Liquidity: Influenced
% ~(row-wise consistency)
]{
\includegraphics[
width=0.47\columnwidth]{te_orderbook_influenced.png}
\label{fig:fig:expr_nyse_influenced}
}
% 
% 
\subfigure[
\csize
Mid-Price: Influenced
]{
\includegraphics[width=0.47\columnwidth]{te_orderbook_midprice_influenced.png}
\label{fig:fig:expr_nyse_influenced_mp}
}
% 
% 
\subfigure[
\csize
Liquidity: Influencer
]{
\includegraphics[
width=0.47\columnwidth]{te_orderbook_influencers.png}
\label{fig:fig:expr_nyse_influencers}
}
% 
% 
\subfigure[\csize
Mid-Price: Influencer
]{
\includegraphics[
width=0.47\columnwidth]{te_orderbook_midprice_influencers.png}
\label{fig:fig:expr_nyse_influencers_mp}
}
\caption{US Stocks: Average precision is computed as a measure of consistency between transfer entropy estimates from two adjacent time windows. The suffix ``*" refers to the proposed TE estimators.}
\label{fig:expr_nyse}
\end{figure*}
    
\subsection{Evaluation on US Stock Data}
\label{sec:nyse_data}
% 
We consider the top 64 most actively traded stocks in the US, and define two sets of time series for \te estimation. The first concerns the frequency of order arrivals during regular intervals: for a fixed interval of 100 milliseconds, we count the number of all orders for each stock over the course of 6 trading days. The second set of time series  constitutes the observed mid-price~(MP) changes over 1 second intervals for each of the same 64 stock,  over a 14 day-period. In both settings, we consider context size of 120 timesteps to forecast the current timestep~($t=120$).
    
We use a window of 3000 timesteps~(5 minutes) for estimating transfer entropy between each pair of 64 stocks; for the case of mid-price temporal dynamics, a window is of one hour~(3600 timesteps). We construct many such windows of identical size during the period of 6 days~(or 14 days), so that the time gap between two subsequent windows increases exponentially in time. We expect transfer entropy estimations from two windows close in time to be similar/ consistent to each other, and significantly different, or inconsistent, as they are further apart in time (due to the inherent heteroskedasticity of competitive markets).
    
In this context, transfer entropy can be represented as a sequence of 64-by-64 matrices. 
We use two criteria for such an evaluation of consistency versus inconsistency. For a given security, we evaluate if the top 8 securities influenced by it are consistent between two adjacent windows~(row-wise consistency). Denoting transfer entropy matrices from two adjacent windows as $\bs{T}^{(1)}$ and $\bs{T}^{(2)}$, we set the top 8 values in each row of $\bs{T}^{(1)}$ as a positive label and the rest as negative labels, and then we use $\bs{T}^{(2)}$ as scores w.r.t. the labels from $\bs{T}^{(1)}$, so as to compute Average Precision~(AP) of the top 8 influenced securities. We expect this score to drop as we increase the time gap between two adjacent windows, as the list of top securities influenced by a security should change as market conditions evolve. 
    
In \figref{fig:fig:expr_nyse_influenced} and \ref{fig:fig:expr_nyse_influenced_mp}, we present the results for this evaluation criterion. For a slide size of up to 1000 timesteps between two adjacent windows, we expect a very high AP score~(high consistency), since it is reasonably small compared to the window size of 3000 timesteps~(or 3600 timesteps). As we increase the size to be a large multiple of the window size, average precision should drop, and then it may remain small or decrease further. NBeats-LSH-A* outperforms all other estimators, following the expected pattern of consistency for the criterion explained above.
% 
(We exclude CKDE from this evaluation since it doesn't scale to high dimensions, and performs poorly on the synthetic datasets.)
% 
Estimations with no regularization or Gaussian noise based perturbation~(GN-RC \& GN-A) lead to low consistency regardless of the slide size; in \figref{fig:fig:expr_nyse_influenced}, GN-A seems to provide high consistency for all the slide sizes which is not a desired pattern of consistency either. The consistency scores for the kNN estimator remain high for small slide sizes, but drops sharply, although sometimes it provides high consistency even at large slide sizes. Copent and ITENE estimators have consistency scores almost constant w.r.t. slide size, indicating their vulnerability to noise. The choice of the underlying discriminator in our estimation approach depends upon the phenomenon of interest; for instance, GBM-LSH-RC* performs well for modeling the order activity (discrete valued time series) whereas TCN-LSH-A* is better suited for modeling the temporal dynamics of mid-prices of stocks~(continuous valued time series).
    
Another criterion for evaluation is to see if the top 8 securities influencing a given security are consistent between two adjacent windows. Similar to the previous exercise, we compute average precision but column-wise instead. Results shown in \figref{fig:fig:expr_nyse_influencers} and \figref{fig:fig:expr_nyse_influencers_mp}, exhibit similar patterns of superiority of our proposed estimator w.r.t. the baselines.
    
Overall, the experimental results suggest that transfer entropy estimation can be unreliable when dealing with long ranged and noisy temporal dependencies, as observed in real world domains like finance and neuroscience. Our proposed estimator, regularized using an LSH based perturbation model, shows robustness in the selected experiments.

% \todo{lesser sample size, and may be more dependencies}

% \todo{numerical results showing how the estimator works for varying sample sizes in comparison to the other baselines and also for dependencies with and without the long-range property}
        
\section{Conclusions}
% 
We established that empirical estimation of transfer entropy between time series is a challenging problem, especially if temporal dependencies are long ranged and noisy. Such noise is common in domains such as finance and neuroscience, though difficult to characterize. We explained theoretical reasons for why well known estimators are prone to such noise, and propose a novel method - a discriminator regularized using a perturbation model based on locality sensitive hashing. We proved consistency of the estimator, and that it's variance decreases linearly in sample size. It is also shown to be efficient empirically in synthetic as well as real world settings of neuroscience and finance domain.

% \clearpage

{
\bibliography{references}
}

% \appendix

% \input{proofs}

% \input{neuro_sec}

\end{document}
