

\subsection{MURGS for model selection}\label{sec:murgs}

Having obtained a causal order \(\pi\), we construct a super-DAG \(\mathcal{G}^{\pi}\) where nodes get assigned all of their predecessors in the causal order as parents.
Formally, the set of parents for node \(j\in [p]\) in \(\mathcal{G}^{\pi}\) is \(pa_{\pi}(j) = \{g \in [p]: \pi(g) < \pi(j)\}\). The goal of \textsc{Phase II} is to find a not too large super-DAG \(\mathcal{G}\) of \(\mathcal{G}_0\) with \(\mathcal{G} \subseteq \mathcal{G}^{{\pi}}\). We advocate to use sparse model selection techniques for this procedure. To that end, we tailor multi-task sparse additive models to the given setting. The resulting model class is of independent interest as it generalizes the sparse group lasso to the multi-task setting. In this paper, we use a common design matrix across all tasks rendering the models multi-response in nature.

In the original RESIT, \citet{Peters2014} propose to iteratively remove nodes from the potential parent set by greedily cycling through the following steps. First, remove a potential parent node from the regression set and second, test whether residuals are still independent and restore the node in question if this is not the case. Thus, the significance level of the test acts as a tuning parameter for the model selection procedure. As pointed out in \citet{Peters2014}, such a procedure strongly depends on the order in which the independence tests are carried out. Accordingly, type-one errors lead to extraneous edges in the final DAG estimate.
Instead, we use feature selection via sparse additive models to prune edges in
\({\mathcal{G}}^{\pi}\).

\begin{algorithm}
  \caption{GroupRESIT algorithm \textsc{Phase II}}
  \SetKwInOut{Input}{Input}
  \SetKwInOut{Output}{Output}
  \SetKwInOut{Initialize}{Initialize}
  \Input{\(\pi\)}
  \Output{Learned DAG \({\mathcal{G}}\)}
  \Initialize{\(pa_\pi\).}
  \For{\(j \in \pi\)}{
    \begin{itemize}
      \item Use MURGS and regress \(\mathbf{X}_j\) on \(\{\mathbf{X}_g\}_{g \in pa_\pi(j)}\)
      \item Obtain the parent set \(pa_{\mathcal{G}}(j) \coloneqq \{g\in pa_\pi(j): f^{(k)}_{j,g,h} \neq 0\}\)
    \end{itemize}
  }
  \label{alg:gRESIT_phase2}
\end{algorithm}

More explicitly, we are interested in the best sparse approximation
of the regression function \(\mathbb{E}[\mathbf{X}_{j} \mid \mathbf{X}_{pa_{\mathcal{G}^{\pi}}(j)} = \mathbf{x}]\) of the form
\begin{equation*}
  \mathbb{E}[{X}_{k}^{(j)} \mid \mathbf{X}_{pa_{\mathcal{G}^{\pi}}(j)} = \mathbf{x}] \approx \sum_{g \in pa_{\mathcal{G}^{\pi}}(j)} \sum_{h \in [d_g]} f_{j,g,h}^{(k)}(x^{(g)}_h),
\end{equation*}
for \(k \in [d_j]\). The first index \(j\) identifies the node whose incoming edges are subject to pruning. The second index \(g\) corresponds to the parent groups, and the third index \(h\) to the individual entries within each parent group. Finally, the superscript \((k)\) selects the respective entry within the response group. To facilitate feature selection, we introduce a regularization functional that encourages a common sparsity pattern shared by both predictor and response group elements.

\begin{figure}
  \centering
  \input{./Figures/notation.tex}
  \caption{Overview of MURGS notation.}
  \label{fig:notation}
\end{figure}

Provided we have found such a functional we remove spurious edges among possible parents \(g \in pa_{\mathcal{G}^{\pi}}(j)\) to obtain a set of relevant edges
\begin{equation*}
  {E}^{\pi} = \{(g,j): f_{j,g,h}^{(k)} \neq 0, \forall k \in [d_j] \text{ and } h\in[d_g]\}.
\end{equation*}
Algorithm~\ref{alg:gRESIT_phase2} summarizes the procedure.

\subsection{Simultaneous Sparse Backfitting}

Before stating the problem more formally, we introduce some relevant notation.
If some random variable \(Z\) has a distribution \(P_Z\), and \(f\) is a function of \(z\), we
denote its \(L_2(P_Z)\) norm by \(\norm{f}^2:= \int_\mathcal{Z} f^2(z) dP_Z = \mathbb{E}[f^2]\).
For an \(n\)-dimensional vector \(\mathbf{\nu} = (\nu_1,
\ldots, \nu_n)\in\mathbb{R}^n\) we define \(\norm{\mathbf{\nu}}_2^2 = \frac{1}{n} \sum_{i=1}^n \nu_i^2\) and
\(\norm{\mathbf{\nu}}_\infty = \max_{i\in[n]} \abs{\nu_i}\). Consider the \(p\)-dimensional random vector
\(\mathbf{Z} = (Z_1, \ldots, Z_p)^T\). Denote by \(\mathcal{H}_i, i\in[p]\) the Hilbert subspace
\(L_2(P_{Z_i})\) of \(P_{Z_i}\)- measurable functions \(f_i(z_i)\) of the scalar variable \(Z_i\)
with \(\mathbb{E}[f_i(z_i)] = 0\). Hence, \(\mathcal{H}_i\) is equipped with the inner product
\(\langle f_i,g_i \rangle = \mathbb{E}[f_i(Z_i)\cdot g_i(Z_i)]\). Whenever a quantity is estimated from a finite sample of size \(n\), we denote this estimate with a hat.

Furthermore, to ease notation in what follows, we take a closer look at node \(j \in [p]\) and its parents \(pa_j \coloneqq pa_{\mathcal{G}^{\pi}}(j)\). We define \(Y^{(k)} \coloneqq X_{k}^{(j)}\) such that we may drop the corresponding subscripts, i.e. \(f_{g,h}^{(k)} \coloneqq f_{j,g,h}^{(k)}\). Further, we define \(f^{(k)}(x) = \sum_{g \in pa_j}\sum_{h \in [d_g]}
f_{g,h}^{(k)}(x^{(g)}_h)\) for \(k\in [d_j]\). Denote by \(\mathcal{L}_{f^{(k)}}(x,y) = (y^{(k)} - f^{(k)}(x))^2\) the quadratic loss.
We often write \(f_{g,h}^{(k)} \coloneq f_{g,h}^{(k)}(X^{(g)}_h)\). For group \(g \in pa_j\), we denote by \(\mathbf{f}_g^{(k)} = (f_{g,1}^{(k)}, \ldots, f_{g,d_g}^{(k)})^T\) the vector of group component functions and let \(\norm{\mathbf{f}_g^{(k)}} = \sqrt{\sum_{h=1}^{d_g} \norm{f_{g,h}^{(k)}}^2 }\). Figure~\ref{fig:notation} visualizes the notation.

The main objective is to perform feature selection among groups of variables. We formulate a regularization scheme to encourage joint functional sparsity. However, the component functions are allowed to vary among response group members and predictor groups while sharing a common sparsity pattern. To achieve this, consider the following regularization functional
\begin{equation*}
  \Phi^j(f) = \sum_{g \in pa_j} \sqrt{d_g} \max_{k \in [d_j]} \norm{\mathbf{f}_g^{(k)}}.
\end{equation*}
The functional \(\Phi^j(f)\) combines the sum of sup-norms regularization with the
functional version of the \(\ell 1 / \ell 2\) norms. Similar to the group lasso
\citep{Yuan2006}, the \(\ell 1 / \ell 2\) norm induces sparsity at the group level. In turn, the
sup-norm penalty encourages sparsity among the \(d_j\) response group components. A group of
component functions across \(k \in [d_j]\) is removed if and only if all involved smooth functions are
estimated to be zero. On the other hand, if a component function group \(g\) is important with a
positive sup-norm for some response \(k\), no additional penalty is imposed on the \(d_j-1\) remaining
ones. For feature selection purposes, this is desirable as all function groups remain in the model
as long as the sup-norm is positive. This is in contrast to imposing the sum of \(\ell_2\) norms
across the \(d_j\) responses which is often applied in multi-task settings \citep[see
e.g.][]{Argyriou2006, Liu2009, Wang2011, Li2020}.

If the response is a scalar random variable, MURGS reduces to the group sparse additive model treated in~\citet{Yin2012}. In turn, if all groups \(g \in pa_j\) contain only singletons this model reduces to the sparse additive model proposed by~\citep{Liu2007,Ravikumar2009}. If only the response variable is vector-valued and the remaining groups are singletons this model reduces to the multi-response sparse additive model discussed by~\citet{Liu2008}.

MURGS can be cast as a penalized \(\mathbf{M}\)-estimator~\citep{Negahban2012} through the following optimization problem
\begin{equation*}\label{eq:m_estimator}
  \hat{\mathbf{f}} = \min_{\mathbf{f} : f_{g,h}^{(k)} \in \mathcal{H}_{g,h}}
  \Bigg\{ \frac{1}{2n} \sum_{\substack{k\in[d_j], \\ i\in[n]}}\mathcal{L}_{f^{(k)}}(\mathbf{x}_i, y_i^{(k)})
  + \lambda \Phi^j(f) \Bigg\}
\end{equation*}
with \(\lambda > 0\) a regularization parameter.

\begin{algorithm}
  \caption{SoftThresholding update for group \(j\)}\label{alg:soft_thresholding}
  \SetKwInOut{Input}{Input}
  \SetKwInOut{Output}{Output}
  \Input{Partial residual \(\hat{R}_g^{(k)}\) for \(k \in [d_j]\), smoother matrices \(\{S_{g,h} : h\in [d_g]\}\), and tuning parameter \(\lambda\).}
  \Output{\(\mathbf{\hat{f}}_g^{(k)}\) = \((\hat{f}_{g,h}^{(k)})_{h \in [d_g]}\) for \(k \in [d_j]\).}
  Estimate \(P_h R_g^{(k)}\) by smoothing: \(\hat{P}_h^{(k)} = S_{g,h}\hat{R}_g^{(k)}\). \\
  Estimate \(s_g^{(k)} = \norm{\mathbf{Q}R_g^{(k)}}\) by:
  \(\hat{s}_g^{(k)} = \Big({1}/{n} \sum_{h \in [d_g]} \norm{\hat{P}_h^{(k)}}^2\Big)^{1/2} \). \\
  \eIf{\(\sum_{k\in K} s_g^{(k)} \leq \sqrt{d_g}\lambda \)}
  {Set \(\mathbf{\hat{f}}_g^{(k)} = 0\) for all \(k \in [d_j]\).}{Order the indices according to
    \(\hat{s}_g^{(k_1)} \geq \hat{s}_g^{(k_2)} \geq \cdots \geq \hat{s}_g^{(k_{d_j})}\). \\
    Set \(m^* = \argmax_{m} \frac{1}{m} \left( \sum_{l = 1}^{m^*} \hat{s}_g^{(k_{l})} - \sqrt{d_g}\lambda \right)\) \\
    \begin{equation*}
      \hat{f}_{g,h}^{(k_i)} =
      \begin{cases}
        \hat{P}_h^{(k_i)}                                                                                                                   \qquad \text{for } i > m^*&    \\
        \begin{aligned}
          \frac{1}{m^*}\Bigg[ &\sum_{l = 1}^{m^*} \hat{s}_g^{(k_{l})}
          - \sqrt{d_g}\lambda \Bigg] \frac{\hat{P}_h^{(k_i)}}{\hat{s}_g^{(k_{i})}}
        \end{aligned}
        &\text{o.w.}
      \end{cases}
    \end{equation*}
  }
  Center \(\hat{f}_{g,h}^{(k)}\) by subtracting its mean.
\end{algorithm}
\paragraph{Block-Coordinate Descent Algorithm}

In order to solve the optimization problem above, we employ a block-coordinate descent algorithm~\citep[see e.g.][]{Hastie2015}.
First, we derive the population version of the estimation problem. This leads to a range of sub-problems in each iteration that can be solved by means of a soft-thresholding update. Similar solutions in multi-task scalar settings have been found in the linear case as well as for sparse additive models~\citep{Liu2009a,Liu2008}. As is common in backfitting algorithms, we obtain a finite sample version of the algorithm by replacing the conditional expectations with nonparametric smoothers.

Consider the partial residual \(R_g^{(k)} = Y^{(k)} - \sum_{g' \neq g} \sum_{h \in [d_{g'}]} f_{g',h}^{(k)}\) and assume that functions in group \(g\) can be fixed. Then the optimization problem on the population level cuts down to
\begin{multline}\label{eq:backfitting}
  \mathbf{f}_g =
  \argmin_{\mathbf{f}_g : f_{g,h}^{(k)} \in \mathcal{H}_{g,h}} \Bigg\{ \frac{1}{2}
    \mathbb{E}\Big[ \sum_{k=1}^{d_j} \big(R_g^{(k)} - \sum_{h\in [d_g]} f_{g,h}^{(k)}\big)^2 \Big] \\
  + \lambda \sqrt{d_g} \max_{k \in [d_j]} \norm{\mathbf{f}_g^{(k)}}\Bigg\}.
\end{multline}
Now, we are ready to state the population block update.
\begin{theorem}\label{thm:backfitting_update}
  Denote \(P_h = \mathbb{E}[\ \cdot \mid X_h^{(g)}]\) the conditional expectation operator,
  \(\mathbf{Q} = (P_h)_{h \in [d_g]}\) and \(s_g^{(k)} = \norm{\mathbf{Q}R_g^{(k)}}\). Assume that
  \(\mathbb{E}[f_{g,h'}^{(k)} \mid X_{h}^{(g)}] =  0\) for all \(h' \neq h\), i.e., the covariance
  among the component functions within groups is zero. Order the indices according to \(s_g^{(k_1)}
  \geq s_g^{(k_2)} \geq \cdots \geq s_g^{(k_{d_j})}\). Then the solution to Eq.~\eqref{eq:backfitting} has coordinate functions given by
  \begin{equation*}
    f_{g,h}^{(k_i)} = P_h^{(k_i)}R_g^{(k_i)}
  \end{equation*}
  if \(i > m^*\) and by
  \begin{equation*}
    f_{g,h}^{(k_i)} = \frac{1}{m^*}\Bigg[ \sum_{l=1}^{m^*} s_g^{(k_{l})} - \sqrt{d_g}\lambda \Bigg]_{+} \frac{P_h^{(k_i)}R_g^{(k_i)}}{s_g^{(k_i)}}
  \end{equation*}
  if \(i \leq m^*\). Here, \(h \in [d_g]\) and
  \begin{equation*}
    m^* = \argmax_{m \in [d_j]} \frac{1}{m} \left( \sum_{l=1}^m s_g^{(k_{l})} - \sqrt{d_g}\lambda \right),
  \end{equation*}
  with \([\,\cdot\,]_+\) denoting the positive part function.
\end{theorem}

The proof involves calculus of variations in Hilbert spaces and is given in Appendix~\ref{sec:backfitting_update}.
The zero covariance assumption is crucial to obtain a closed form update~\citep[see][]{Foygel2010}. However, for feature
selection we are primarily interested in the case where the sup-norm subdifferential evaluated at
\((\norm{\mathbf{f}_g^{(1)}}, \ldots, \norm{\mathbf{f}_g^{(d_j)}})^T\) is the zero vector. It turns
out that the condition for this case holds in general without any assumptions on the conditional
expectation. The following result makes this explicit:
\begin{proposition}\label{prop:all_zeros}
  \(\norm{\mathbf{f}_g^{(k)}} = 0\) for all \(k \in [d_j]\) if and only if \(\sum_{k=1}^{d_j} \norm{\mathbf{Q}R_g^{(k)}} \leq \lambda \sqrt{d_g}\).
\end{proposition}
Once the stationary condition is derived, the proof to the proposition is straight-forward and can also be found in Appendix~\ref{sec:backfitting_update}.
Based on Theorem~\ref{thm:backfitting_update}, Algorithms~\ref{alg:soft_thresholding} and~\ref{alg:backfitting} detail the backfitting algorithm for MURGS in the finite sample setting.
\begin{algorithm}
  \caption{Backfitting algorithm}\label{alg:backfitting}
  \SetKwInOut{Input}{Input}
  \SetKwInOut{Output}{Output}
  \SetKwInOut{Initialize}{Initialize}
  \Input{Data \(\mathbf{X} = \{\mathbf{X}_g \in \mathbb{R}^{n\times d_g}: g \in pa_j\}\), \(Y \in \R^{n \times d_j}\), regularization parameter \(\lambda\).}
  \Output{Fitted functions \(\hat{\mathbf{f}} = (\hat f_{g,h}^{(k)})_{g \in pa_j, h \in [d_g], k \in
  [d_j]}\).}
  \Initialize{\(\hat{\mathbf{f}} = \mathbf{0}\) pre-compute the smoother matrices \(\{S_{g,h} \in \mathbb{R}^{n\times n} : h \in {d_g}, g\in pa_j\}\).}
  % \While{\(t \leq \text{max\_iter}  \ \& \  \text{incr} > \text{tol}\)}{
  \For{\(g \in pa_j\) until convergence}{

    \begin{itemize}
      \item[(i)] Update partial residual \(\hat{R}_g^{(k)} = Y^{(k)} - \sum_{g' \neq g} \sum_{h \in [d_{g'}]} \hat{f}_{g',h}^{(k)}\)
      \item[(ii)] \(\mathbf{\hat{f}}_g^{(k)} \gets \text{SoftThresholding}(\hat{R}_g^{(k)}, S_{g,:}, \lambda\))
    \end{itemize}

  }
\end{algorithm}
We choose the regularization parameter \(\lambda\) by the generalized cross validation (GCV) criterion from \citet{Liu2007} and adapted to the multi-response setting by \citet{Liu2008}. In the multi-response case, the GCV criterion is given by
\begin{equation*}
  \text{GCV}(\lambda) = \frac{1}{n} \sum_{i=1}^n \frac{\sum_{k=1}^{d_j}
  \mathcal{L}_{\hat{f}^{(k)}}(\mathbf{x}_i, y_i^{(k)})}{(n^2d_j^2 - (nd_j)\text{df}(\lambda))^2},
\end{equation*}
where \(\text{df}(\lambda) = d_j\sum_{g\in d_i} \nu_g I(\sum_{k=1}^{d_j}\norm{
\mathbf{f}_{g}^{(k)}} \neq 0)\) and \(\nu_g = \sum_{h \in [d_g]} \text{tr}(S_{g,h})\) denotes the effective degrees of freedom for the local linear smoother \(S_{g,h}\).
