One can generally not identify the true underlying DAG when given only observational data from the model in Definition~\ref{def:SEM}~\citep[see][for an overview]{Peters2014}. However, as we will see, restricting the functional form of the right-hand side in~\eqref{eq:sem} to be additive in the noise vectors renders the model identifiable.

\begin{definition}\label{def:ANM}
  Let \((\mathbf{X},\mathbf{N}, \mathcal{F},\mathcal{G}_0)\) be a GSEM. If the functions in
  \(\mathcal{F}\) are additive in the noise term, i.e., if
  \[\mathbf{X}_g = f_g(\mathbf{X}_{pa(g)}) + \mathbf{N}_g,\]
  if \(\mathbf{N}_g\) has a strictly positive density with respect to the
  Lebesgue measure for all \(g\in[p]\), and if \(\mathcal{G}_0\) is acyclic,
  then \((\mathbf{X},\mathbf{N}, \mathcal{F},\mathcal{G}_0)\) is a \emph{group
  additive noise model} (GANM).
\end{definition}

In Definition~\ref{def:SEM}, the noise vectors \(\mathbf{N}_g\) and the predictors \(\mathbf{X}_g\) are permitted to have different dimensions, whereas in Definition~\ref{def:ANM} their dimensions must coincide. Moreover, the graph underlying the SEM in Definition~\ref{def:SEM} is not required to be acyclic, while acyclicity is assumed in Definition~\ref{def:ANM}. Figure~\ref{fig:ganm_example} illustrates a GANM with three groups of varying sizes. In addition to the implicit assumption of causal sufficiency—i.e., that no latent variables are present due to the joint independence of the noise terms—the framework for GANMs also requires the notion of causal minimality.

\begin{definition}
  Let \((\mathbf{X}, \mathbf{N}, \mathcal{F}, \mathcal{G}_0)\) be a GANM.
  We say that \(P_\mathbf{X}\) satisfies \emph{causal minimality} if all functions in
  \(\mathcal{F}\) are non-constant in any of their arguments.
\end{definition}

Suppose that the function class \(\mathcal{F}\subseteq C^3(\R^{d_{pa(g)}},\R^{d_g})\), where \(d_{pa(g)}=\sum_{j\in pa(g)}d_j\), consists of nonlinear functions—more precisely, for each output coordinate \(k\in [d_g]\) and each input-dimension \(i\in[d_{pa(g)}]\) there exists some \(\mathbf{x} \in \mathbb{R}^{d_{pa(g)}}\) such that \(\partial^2 f_k/\partial x_i^2(x) \neq 0\) or \(\partial^3 f_k/\partial x_i^3(x) \neq 0\). Then, under causal minimality, identifiability for the bivariate scalar case has been established by~\citet{Hoyer2009}. Consider a
bivariate GANM where the two groups have size~\(1\) respectively, i.e.,
\begin{equation}\label{eq:scalar_bivariate_anm}
  \begin{split}
    X_1 &= N_1 \\
    X_2 &= f_2(X_1) + N_2,
  \end{split}
\end{equation}
where \(X_1 \indep N_2\). Identifiability follows from observing that a regression \(\mathbb{E}[X_2 \mid
X_1] = f_2(X_1)\) along the causal direction leads to independence among the residuals and the predictor
\(X_1\). For general nonlinear functions \(f\), the regression in the anti-causal direction does not lead to independence among the residuals and \(X_2\). Indeed, \citet{Hoyer2009} derive a specific differential equation that the triple \((f_2, P_{X_1}, P_{N_2})\) needs to satisfy for the backwards model to exist. A similar differential equation can be obtained in the bivariate group case.

\begin{condition}\label{cond:identifiability}
  The triple \((f_g, P_{\mathbf{X}_j}, P_{\mathbf{X}_g})\) does not satisfy the following differential equation:
  \begin{multline}\label{eq:tensor_differential_eq}
    D_{\mathbf{x}_j}\mathbf{H}_\xi(\mathbf{x}_j) \Big[(D_{\mathbf{x}_j \mathbf{x}_j}\pi_1(\mathbf{x}_j,\mathbf{x}_g))^{-1} D_{\mathbf{x}_j \mathbf{x}_g}\pi_1(\mathbf{x}_j,\mathbf{x}_g)\Big]
    \\
    \begin{aligned}
      = &D_{\mathbf{x}_j} D_{\mathbf{x}_j \mathbf{x}_g}\pi_1(\mathbf{x}_j,\mathbf{x}_g) - \Big[
        D_{\mathbf{x}_j}\big( \mathbf{H}_{f_g}(\mathbf{x}_j)[\nabla \nu(\mathbf{u})] \big) \\
        - &D_{\mathbf{x}_j}\big( \mathbf{J}_{f_g}(\mathbf{x}_j)^\top \mathbf{H}_{\nu}(\mathbf{u}) \mathbf{J}_{f_g}(\mathbf{x}_j) \big)
      \Big] \\
      \Big[(&D_{\mathbf{x}_j \mathbf{x}_j}\pi_1(\mathbf{x}_j,\mathbf{x}_g))^{-1} D_{\mathbf{x}_j \mathbf{x}_g}\pi_1(\mathbf{x}_j,\mathbf{x}_g)\Big],
    \end{aligned}
  \end{multline}
  where \(\mathbf{u} \coloneqq \mathbf{x}_g-f_g(\mathbf{x}_j)\) and further \(\nu \coloneqq \log p_{\mathbf{N}_g}\), \(\xi \coloneqq \log p_{\mathbf{X}_j}\), with arguments \(\mathbf{u}, \mathbf{x}_j\), respectively. Additionally, \(\pi_1({\mathbf{x}_j,\mathbf{x}_g}) \coloneqq \log p_{\mathbf{X}_j,\mathbf{X}_g}(\mathbf{x}_j, \mathbf{x}_g)\) and
  \begin{align*}
    D_{\mathbf{x}_j \mathbf{x}_j}\pi_1(\mathbf{x}_j,\mathbf{x}_g) &= \mathbf{H}_{\xi}(\mathbf{x}_j) - \mathbf{H}_{f_g}(\mathbf{x}_j)[\nabla \nu\bigl(\mathbf{u}\bigr)] \\
    &+ \mathbf{J}_{f_g}(\mathbf{x}_j)^\top \mathbf{H}_{\nu}\bigl(\mathbf{u}\bigr) \mathbf{J}_{f_g}(\mathbf{x}_j),
  \end{align*}
  and \(D_{\mathbf{x}_j \mathbf{x}_g}\pi_1(\mathbf{x}_j,\mathbf{x}_g) = -\mathbf{J}_{f_g}(\mathbf{x}_j)^\top \mathbf{H}_{\nu}\bigl(\mathbf{u}\bigr)\). \(\mathbf{J}\) and \(\mathbf{H}\) denote Jacobian and Hessian matrices or tensors, respectively.
\end{condition}

\begin{remark}
  Note that when \(\mathbf{J}_{f_g}(\mathbf{x}_j)\) is full rank and \(\mathbf{H}_{\nu}(\mathbf{u})\) is positive definite, we can fully isolate the third-order derivative tensor \(D_{\mathbf{x}_j}\mathbf{H}_\xi(\mathbf{x}_j)\):
  \begin{equation*}
    \begin{split}
      D_{\mathbf{x}_j}\mathbf{H}_\xi(\mathbf{x}_j) &= D_{\mathbf{x}_j}(D_{\mathbf{x}_j\mathbf{x}_g}\pi_1)(D_{\mathbf{x}_j\mathbf{x}_g}\pi_1)^{-1}D_{\mathbf{x}_j\mathbf{x}_j}\pi_1 \\
      &- \Big[D_{\mathbf{x}_j}(\mathbf{H}_{f_g}(\mathbf{x}_j)[\nabla \nu(\mathbf{u})]) \\
      &- D_{\mathbf{x}_j}(\mathbf{J}_{f_g}(\mathbf{x}_j)^\top \mathbf{H}_{\nu}(\mathbf{u}) \mathbf{J}_{f_g}(\mathbf{x}_j))\Big],
    \end{split}
  \end{equation*}
  where we have suppressed the arguments $\mathbf{x}_j$ and $\mathbf{x}_g$ of \(\pi_1\). Observe
  that \(\mathbf{H}_\xi(\mathbf{x}_j)\) enters only via the term
  \(D_{\mathbf{x}_j\mathbf{x}_j}\pi_1\) on the right hand side. In fact, this equation bears a direct resemblance to the scalar form derived by~\citet{Hoyer2009}, highlighting that our result is a natural generalization of the scalar case. In general, Eq.~\eqref{eq:tensor_differential_eq} describes a directional projection of \(D_{\mathbf{x}_j}\mathbf{H}_\xi(\mathbf{x}_j)\) onto the directions defined by the columns of the matrix \((D_{\mathbf{x}_j\mathbf{x}_j}\pi_1)^{-1} D_{\mathbf{x}_j\mathbf{x}_g}\pi_1\). The dimensions \(d_{x_j}\) and \(d_{x_g}\) determine the range of the resulting tensor contraction.
  A detailed exploration of the role of the group sizes and implications for the form of the triple \((f_g, P_{\mathbf{X}_j}, P_{\mathbf{X}_g})\) is deferred to future work.
\end{remark}

\begin{definition}
  Consider a \emph{bivariate GANM} given by the equations
  \begin{equation*}
    \mathbf{X}_j = \mathbf{N}_j,\quad\mathbf{X}_g = f_g(\mathbf{X}_j) + \mathbf{N}_g,
  \end{equation*}
  for \(\{j,g\} = \{1,2\}\). If the corresponding triple \((f_g, P_{\mathbf{X}_j}, P_{\mathbf{X}_g})\) satisfies Condition~\ref{cond:identifiability}, we call this model an \emph{identifiable bivariate} GANM.
\end{definition}

\begin{theorem}\label{theorem:bivariate_identifiability}
  Let \(P_{\mathbf{X}}\) be the joint distribution of \(\mathbf{X}\) generated by an
  \emph{identifiable bivariate} GANM \((\mathbf{X}, \mathbf{N}, \mathcal{F}, \mathcal{G}_0)\)
  and suppose that causal minimality holds. Then, the graph \(\mathcal{G}_0\) is
  identifiable from \(P_{\mathbf{X}}\).
\end{theorem}

All proofs for this section are provided in Appendix~\ref{sec:proofs_A}. As demonstrated by \citet{Peters2014}, extending the analysis from two to multiple variables can be achieved by appropriately constraining the involved distributions and functions so that, locally, the problem reduces to the bivariate case.
\begin{corollary}\label{corollary:multivariate_identifiability}
  Let \((\mathbf{X}, \mathbf{N}, \mathcal{F}, \mathcal{G}_0)\) be a GANM with \(p\) variables. Suppose that for each node \(g \in [p]\), each parent \(j \in pa(g)\), and every set \(S\) satisfying
  \begin{equation*}
    pa(g) \setminus \{j\} \subseteq S \subseteq nd(g) \setminus \{g,j\},
  \end{equation*}
  there exists a realization \(\mathbf{x}_S \in \mathbb{R}^{\abs{S}}\) for which the conditional distribution of \(\mathbf{X}_S\) has strictly positive density with respect to the Lebesgue measure, and the triple
  \begin{equation*}
    \left(f_g(\mathbf{x}_{pa(g)\setminus \{j\}}, \mathbf{x}_j), P_{\mathbf{X}_j,\mathbf{X}_g\mid \mathbf{X}_S=\mathbf{x}_S}, P_{\mathbf{X}_j}\right)
  \end{equation*}
  satisfies Condition~\ref{cond:identifiability}. Here, the arguments \(\mathbf{x}_{pa(g)\setminus \{j\}}\) are held fixed, making the function \(f_g(\mathbf{x}_{pa(g)\setminus \{j\}}, \mathbf{x}_j)\) depend solely on \(\mathbf{x}_j\).

  Then the graph \(\mathcal{G}_0\) is identifiable from the joint distribution \(P_{\mathbf{X}}\).
\end{corollary}

Clearly, fixing all arguments of \(f_g\) except \(\mathbf{X}_j\) also leads to a bivariate GANM. Corollary~\ref{corollary:multivariate_identifiability} suggests that this is not enough. Instead, we need to put restrictions on the conditional distribution of \(\mathbf{X}_j\). The following result guides the estimation strategy described in the ensuing section.

\begin{lemma}\label{lemma:ancestor_independence}
  Let $(\mathbf{X}, \mathbf{N}, \mathcal{F}, \mathcal{G}_0)$ be a GSEM. For all \(S \subseteq
  nd_{\mathcal{G}_0}(g)\), we have \(\mathbf{N}_g \indep \mathbf{X}_S\).
\end{lemma}

In particular, if \(g \in [p]\) is a sink node, i.e., a node without any descendants, its
non-descendants is the set of all other nodes in \(\mathcal{G}_0\). Thus, its corresponding noise
vector \(\mathbf{N}_g\) is independent of all other variables \(\mathbf{X}_{[p]\setminus
\{g\}}\).
