\section{Introduction}
\label{sec:introduction}

Many real-world regression problems involve modeling of complex conditional distributions of an outcome $Y$ given an input vector $\mathbf{x} = [x_1,\ldots,x_D]^\top$.
In such instances, a mere prediction of the mean would be insufficient to capture the aleatoric uncertainty inherent in the data-generating process and a comprehensive understanding requires estimating the conditional distribution $Y|\mathbf{x}$~\citep{Kneib2021}.
This becomes more challenging for multivariate outcomes $\mathbf{Y} \in \mathbb{R}^D$, both in terms of estimation and model interpretation.
While \acrfull{NF}~\citep{Papamakarios2018} models can represent such complex conditional multivariate distributions, their deep neural network architectures often hinder interpretability.
Statistical methods for flexible density estimation in high dimensions have largely focused on graphical models, often involving copulas \citep[e.g.,][]{Liu2011,Bauer2016,Nagler2016}.
A notable exception is the \gls{MCTM}, a transformation model approach~\citep{Hothorn2018} that shares similarities with \glspl{NF} and models the joint relationship similar to copulas.
While less flexible than \glspl{NF}, \glspl{MCTM} preserve the interpretability of the feature-outcome relationship.

\begin{figure}[t!]
  \centering
  \includegraphics[width=0.7\columnwidth]{gfx/mctm_samples_moons.pdf}
  \caption{An \gls{MCTM} (orange) applied to a complex data distribution (blue).
    The model captures marginals well but fails to capture the dependence structure.}
  \label{fig:mctmfail}
\end{figure}

\paragraph{Our Contribution}
To bridge the gap between flexible black-box and rigid but interpretable statistical models, we propose a hybrid approach combining the transparency of \glspl{MCTM} with the flexibility of \glspl{NF}.
We thereby provide a method that can be applied if the understanding of feature effects on each response variable is indispensable. At the same time, our method allows complex modeling of the dependency structure of the outcome dimensions using the idea of masking from autoregressive flows.
%Our method addresses limitations of \gls{MCTM} (\Cref{fig:mctmfail}) and enhances the interpretability of \glspl{NF}.
We evaluate its effectiveness against \glspl{MCTM} and other density estimation models on simulated and real-world datasets.

\paragraph{Notation} We refer to random variables with a capital Latin letter, e.g., $Y$, and their realizations (observed or sampled) with a lowercase Latin letter, e.g.\ $y$.
The corresponding \gls{CDF} is referred to as $F_Y$ and the \gls{PDF} as $f_Y$.
For sets and vectors, we use bold-faced variables, e.g.\ $\mathbf{Y}=\{Y_1,\dots,Y_J\}$.
For individual parameters, optimized during the training procedure we use lowercase Greek letters, e.g.\ $\vartheta$, and bold-faced Greek letters for sets or vectors of parameters, e.g.\ $\bm{\vartheta}=\left(\vartheta_1,\ldots,\vartheta_M\right)^\top$.
Bold capital Greek letters indicate tuples, e.g.\ $\bm{\Theta}=\left(\bm{\theta}_1,\bm{\theta}_2,\alpha,\beta\right)$.

\section{Background and Related Work}
\label{sec:background}

Several methods have emerged for modeling complex conditional distributions.
Among these methods, transformation-based models have proven to be highly flexible and have been extensively used since their introduction by~\citet{Box1964}.

\subsection{Probability Transformation}

Transformation models are based on the \emph{probability-transformation theorem}, using a bijective and continuously differentiable \emph{transformation function} $h$ to map a complex distribution $F_Y$ to a simpler one $F_Z$ (often standard normal)~\citep{Kneib2021} via $F_Y = F_Z(h(y)).$ This allows density estimation without restrictive shape assumptions~\citep{Hothorn2018}.
The density $f_Y$ is calculated from the base density $f_Z$ by
\begin{equation}\label{eq:cov}
  f_Y(y) = f_Z\left(h(y)\right) \left|\det\nabla{h}(y)\right|,
\end{equation}
where the absolute Jacobian determinant $\left|\det\nabla{h}(y)\right|$ reflects the change in density induced by the transformation.
To sample from $F_Y$, we can draw samples $z$ from $F_Z$ and apply the inverse transformation $h^{-1}$ to obtain $y$.
%Therefore, $h$ must be bijective and continuously differentiable.

\subsection{Normalizing Flows and Transformation Models}

\glspl{NF} apply the probability-transformation theorem using a series of $K$ simple bijective transformation functions $h_k$ to form more expressive transformations $h(z)= h_K \circ h_{K-1} \circ \dots \circ h_1(z)$ \cite[see, e.g.][for a comprehensive review]{Papamakarios2021}.
Other methods use a single flexible transformation like sum-of-squares polynomials~\citep{Jaini2019} or splines~\citep{Durkan2019}.
\Glspl{CTM} also focus on single transformation function, which, however, can be easily interpreted~\citep[see, e.g.,][]{Hothorn2018}.
This idea has also recently been combined with neural-networks~\citep{Baumann2021,Sick2021,Arpogaus2023,Kook2024}.

A fundamental difference between \glspl{NF} and \glspl{CTM} is the definition of the transformation direction~\citep{Kook2024}:
\glspl{NF} transform a simple base density $f_Z(z)$ into a more complex target density $f_Y(y)$ through a series of transformations.
\glspl{CTM} do it vice-versa, using only a single transformation function, often utilizing flexible Bernstein polynomials (see~\Cref{sec:bernstein_poly} for details).

\glspl{CTM} allow interpretability through the use of \glspl{SAP}~\citep{Klein2022,Rugamer2023} (see \Cref{sec:sap,sec:interpretation_details}).
The black-box character of neural networks in \glspl{NF} hinders interpretability, and variants using decision trees~\citep{Papastefanopoulos2025} and graphical models~\citep{Wehenkel2021a} have been explored in recent years.

The previous concepts can be further extended to multivariate objects $\mathbf{y} = (y_1,\ldots,y_J)^\top$ and can also be applied conditional on $U$ features $\mathbf{x} = (x_1,\ldots,x_U)^\top$ as described in the following.

\subsection{Multivariate and Conditional Models}

To model a potentially multivariate conditional distribution $F_{\mathbf{Y}|\mathbf{X}=\mathbf{x}}$,
%the parameters of the transformation function $\theta_i$ have to depend on the features $\mathbf{x}\in\mathbf{X}$.
%Hence,
we define
$z_j=h(y_j, \bm{\theta}_j) \text{ with } \bm{\theta}_j=c_j(\mathbf{y},\mathbf{x})$
where $h$ is a bijective \emph{transformation function} for the element $y_j$ with parameters $\bm{\theta}_j$ obtained from the \emph{conditioner} $c_j$~\citep{Papamakarios2021}.
The latter can be an arbitrarily complex function that depends on the features $\mathbf{x}$ and, in the multivariate case, on some variables of the response $\mathbf{y}$.
The limiting factor here is the computational tractability of the Jacobian in \Cref{eq:cov}.
%Numerous computationally efficient approaches for implementing the conditioner to extend transformation-based models to multivariate data have been proposed.

\paragraph{Multivariate Conditional Transformation Models} To make computations tractable and provide a model that is better to interpret, \glspl{MCTM} define a triangular transformation $H(\bm{y},\bm\Theta(\bm{x}))=\left(h_1(y_1|\bm{\theta}_1),\ldots,h_J(y_J|y_1,\ldots,y_{J-1},\bm{\theta}_1,\dots,\bm{\theta}_J)\right)^\top$.
This Tranformation is characterized by a set of linearly combined marginal basis transformations $\tilde{h}_j(y_j|\bm{\theta}_j)$, scaled with a $(J \times J)$ triangular coefficient matrix $\bm{\Lambda}$ to encode structural information: $h_j(y_j|y_1,\ldots,y_{j-1},\bm{\theta}_1,\dots,\bm{\theta}_j) = \lambda_{j,1} \tilde{h}_{1}(y_1|\bm{\theta}_1)+\ldots+\lambda_{j,j-1} \tilde{h}_{j-1}(y_{j-1}|\bm{\theta}_{j-1}) + \tilde{h}_j(y_j|\bm{\theta}_j)$.
The function $\bm\Theta: \bm{x}\mapsto(\bm\Lambda,\bm\theta_1,\ldots,\bm\theta_J)$ accounts for feature effects on the parameters~\citep{Klein2022}.
Hence,~\glspl{MCTM} can only model linear dependencies between the response variables $y_j$, making them an inadequate choice for complex joint distributions as shown in \Cref{fig:mctmfail}.
This limitation motivates the development of more flexible approaches, such as the hybrid model proposed in this paper.

\paragraph{Normalizing Flows} In the~\gls{NF} literature, various approaches are used to deal with computations involving the aforementioned Jacobian. Among these, autoregressive models are one of the most prevalent ones. Due to the autoregressive structure, these \glspl{NF} yield a triangular Jacobian and thus the determinant simplifies to the product of the Jacobian's diagonal entries~\citep{Papamakarios2021,Kobyzev2021}.
These models factorize a multivariate distribution based on the chain rule of probability. They are typically implemented by conditioning the transformation $h_j$ of the $j$th element of $\mathbf{y}$ on all previous elements $\mathbf{y}_{<j} = (y_1, \dots, y_{j - 1})$ or a subset of those.
This can be seen as a non-linear generalization of the triangular coefficient matrix $\Lambda$ used in~\glspl{MCTM}~\citep{Kobyzev2021}.
There are two widely adopted neural-network architectures for implementing an autoregressive conditioner $c_j$:~\glspl{MAF}~\citep{Papamakarios2018} and \glspl{CF}~\citep{Dinh2017}.

\glspl{MAF} generalize this idea by implementing the conditioner with neural networks that incorporate autoregressive constraints through parameter masking, inspired by the \gls{MADE} architecture \citep{Germain2015}.
The main advantage is that the parameters can be obtained in one neural network pass $\left(\bm{\theta}_1,\ldots,\bm{\theta}_J\right)^\top=\left(c_1(),c_2(y_1),\ldots,c_J(\mathbf{y}_{<j})\right)^\top$. Further, if $h$ and the conditioning network are expressive enough, they remain universal approximators to transform between any two distributions~\citep{Papamakarios2021}.
In practice, however, the results highly depend on the ordering of the input variables.
Because some orderings can be extremely difficult to learn, multiple~\glspl{MAF} layers are often stacked with permutations between.

% A major disadvantage is the computational cost of inverting~\glspl{MAF}.
% Mathematically equivalent is a formulation conditioning on the latent representation $\mathbf{z}$, which are known as~\gls{IAF} and allow for more efficient sampling, since the autoregressive recursion is offloaded to the training procedure~\citep{Kingma2016}.

\glspl{CF} can provide fast sampling and density evaluation by splitting the response vector $\mathbf{y}$ into two subsets $$\mathbf{y}=(\underbrace{y_1,\ldots,y_j}_{\mathbf{y}_{\leq j}},\underbrace{y_{j+1},\ldots,y_{J}}_{\mathbf{y}_{> j}})^\top$$ and then applying the transformation $h$ only to one subset conditioned on the other.
Typically, these subsets are chosen to contain half of the variables and the result of the transformation on the first subset is then permuted and another transformation is applied to the remaining subset: $\mathbf{z}=\left(h(\mathbf{y}_{\leq d},c_1(\mathbf{y}_{> d},\mathbf{x}), h(\mathbf{y}_{> d},c_2(\mathbf{y}_{\leq d}, \mathbf{x})\right)^\top$.

\paragraph{Graphical Transformation Models} are an extension of \glspl{MCTM}, closely related to the methodology proposed in this paper.
\citet{Herp2025} replaced the Gaussian copula by parameterizing the $\lambda_{i,j}$ entries of $\bm{\Lambda}$ using B-splines, dependent on the previous variables, resulting in an additive \gls{CF}: $h_j(y_j|y_1,\ldots,y_{j-1},\bm{\theta}_1,\dots,\bm{\theta}_j) = \lambda_{j,1}\left(\tilde{h}_{1}(y_1|\bm{\theta}_1)\right)+\ldots+\lambda_{j,j-1}\left(\tilde{h}_{j-1}(y_{j-1}|\bm{\theta}_{j-1})\right) + \tilde{h}_j(y_j|\bm{\theta}_j)$.
To enhance expressiveness of the dependencies while preserving marginals, multiple \glspl{CF} are chained via permutations.
P-spline penalties mediate between the flexible \glspl{CF} and the baseline MCTM.
This framework improves interpretability of nonlinear conditional dependencies through local conditional pseudo-correlations and enables sparse undirected graphical models via a LASSO penalty promoting conditional independence.
However, it offers less flexibility as a \gls{NF} for decorrelation than the approach proposed here and has not yet been extended to the conditional setting.

Further background is given in \Cref{sec:mctm} and \ref{sec:ar_models}.

\section{Hybrid Conditional Masked Autoregressive Bernstein Flows}
\label{sec:method}

To bridge the gap between the limited expressiveness of~\glspl{MCTM} and the black-box nature of general~\glspl{NF}, we propose a combined approach that leverages the strengths of both methodologies.
This hybrid approach allows us to capture complex dependencies in the joint distribution while retaining interpretability for the marginal distributions.

\subsection{Model Specification}

Let $\mathbf{y} = (y_1, \dots, y_J)^\top \in \mathbb{R}^J$ be a $J$-dimensional response vector and $\mathbf{x} = (x_1, \dots, x_U)^\top \in \mathbb{R}^U$ be a vector of $U$ features.
Our goal is to learn the conditional joint density $f_{\mathbf{Y} | \mathbf{X}}(\mathbf{y} | \mathbf{x})$.

\paragraph{Step 1: Modeling Marginal Distributions}

First, we model the marginal distributions of each response variable $Y_j$ given $\mathbf{X}$ using a transformation model:
\begin{equation*}
  \begin{aligned}
    H_1(\mathbf{y},\mathbf{\Theta}\left(\mathbf{x}\right))
     & = \left(h_1(y_1,\bm{\theta}_{1,\mathbf{x}}),\ldots,h_1(y_J,\bm{\theta}_{J,\mathbf{x}})\right)^\top = \\
     & =\mathbf{w} = (w_1,\ldots,w_J)^\top,
  \end{aligned}
\end{equation*}
where each element of $\mathbf{W} = (W_1, \dots, W_J)^\top$ follows the base distribution $F_Z$ (\Cref{fig:h1}).
%For $j = 1, \dots, J$,
%
\begin{figure}[t]
  \centering%
  \begin{subfigure}[b]{.45\linewidth}%
    \includegraphics[width=\linewidth]{moons_y.png}%
  \end{subfigure}
  \tikzmark{moons-1}%
  \hfill %
  \tikzmark{moons-h1}%
  \begin{subfigure}[b]{.45\linewidth}%
    \includegraphics[width=\linewidth]{moons_w.png}%
  \end{subfigure}
  \caption{The first transformation $H_1$ maps the marginals to the base distribution $F_Z$.}\label{fig:h1}%
  \tikz[remember picture, overlay, color=mpl_blue]{
    \draw[-latex, line width=1pt]
    ([yshift=.1\linewidth, xshift=-.03\linewidth]pic cs:moons-1)
    --([yshift=.1\linewidth, xshift=.05\linewidth]pic cs:moons-h1)
    node[midway,below] () {$H_1(\mathbf{y}|\mathbf{x})$};
  }%
  \tikz[remember picture, overlay, color=mpl_blue]{
    \draw[-latex, line width=1pt]
    ([yshift=.35\linewidth, xshift=.05\linewidth]pic cs:moons-h1)
    --([yshift=.35\linewidth, xshift=-.03\linewidth]pic cs:moons-1)
    node[midway,above] () {$H^{-1}_1(\mathbf{w}|\mathbf{x})$};
  }%
  \vspace{-12pt}
\end{figure}
%i

Specifically, for each $j = 1, \dots, J$, we assume:
\begin{equation*}%\label{eq:marginal_tm}
  \mathbb{P}(Y_j \leq y_j | \mathbf{X} = \mathbf{x}) = F_{Y_j | \mathbf{X}}(y_j | \mathbf{x}) = F_Z(h_1(y_j, \bm{\theta}_{j,\mathbf{x}})),
\end{equation*}
where $F_Z$ is a known base distribution
%, with log-concave density function (e.g., the standard normal~\gls{PDF}) 
and $h_1(\cdot)\colon \mathbb{R} \rightarrow \mathbb{R}$ is a strictly monotonic marginal transformation function, with potentially feature-dependent parameters $\bm{\theta}_{j,\mathbf{x}}$ obtained form the conditioning function $\mathbf{\Theta}\left(\mathbf{x}\right)$.

For interpretability, a (shifted) Bernstein polynomial can be used:
\begin{equation*} %\label{eq:shifted_bernstein}
  \begin{split}
    h_1(y_j, \bm{\theta}_{j,\mathbf{x}}) & =  \bm\alpha_j(y_j)^\top \bm{\vartheta}_j + \beta_{j,\mathbf{x}},           \\
    \bm\alpha_j(y_j)^\top \bm{\vartheta}_j  & =  \frac{1}{M+1} \sum_{i = 0}^M \text{Be}_i(\tilde{y}_j) \vartheta_{ji},
  \end{split}
\end{equation*}
where $\text{Be}_m(\tilde{y}_j)$ is the density of a Beta distribution with parameters $i+1$ and $M-i+1$ evaluated at the normalized response $\tilde{y}_j = (y_j - l_j)/(u_j - l_j) \in [0, 1]$, with $u_j > l_j$ defining the support of $Y_j$\footnote{The transformation is generally unbounded, as we apply linear extrapolation outside the bound of the polynomial, as all our experiments show. However, other transformations may be considered if expressiveness in the outer tails is a requirement}.
The vector $\bm{\vartheta}_j = (\vartheta_{j0}, \dots, \vartheta_{jM})^\top$ contains the Bernstein coefficients, which are constrained to be increasing for monotonicity (see~\Cref{sec:bernstein_poly} for details).
The shift term $\beta_{j,\mathbf{x}}$ allows the marginal distribution of $Y_j$ to vary with the features.
Instead of a feature-dependent shift term, we can also let $\bm{\vartheta}_j$ depend on $\mathbf{x}$ to change the shape of the transformation $h_1$ (or combine feature-dependent shift with feature-dependent $\bm{\vartheta}_j$s).

\paragraph{Step 2: Modeling Dependency Structures}

We model the dependencies between elements of $\mathbf{W}$ using an autoregressive flow $H_2(\mathbf{w},\mathbf{\Psi}(\mathbf{w}, \mathbf{x})): \mathbf{W}\to\mathbf{Z}, \mathbf{w}\mapsto\mathbf{z}$:
\begin{equation*}
  \begin{split}
    z_{1} & = h_2(w_1 | \bm{\psi}_{1, \mathbf{x}}),                                         \\
    z_{j} & = h_2(w_j | \bm{\psi}_{j, \mathbf{w}_{<j}, \mathbf{x}}), \quad j = 2, \dots, J,
  \end{split}
\end{equation*}
where $h_2(\cdot)$ is an increasing transformation function, with parameters $\bm{\psi}_{j, \mathbf{w}_{<j}, \mathbf{x}}$ depending on previous elements of $\mathbf{w}$ and features $\mathbf{x}$.
%
\begin{figure}[t]
  \centering%
  \begin{subfigure}[b]{.45\linewidth}%
    \includegraphics[width=\linewidth]{moons_w.png}%
  \end{subfigure}
  \tikzmark{moons-2}%
  \hfill %
  \tikzmark{moons-h2}%
  \begin{subfigure}[b]{.45\linewidth}%
    \includegraphics[width=\linewidth]{moons_z.png}%
  \end{subfigure}
  \caption{The second transformation $H_2$ removes the dependency structure.}\label{fig:h2}%
  \tikz[remember picture, overlay, color=mpl_blue]{
    \draw[-latex, line width=1pt]
    ([yshift=.1\linewidth, xshift=-.03\linewidth]pic cs:moons-2)
    --([yshift=.1\linewidth, xshift=.05\linewidth]pic cs:moons-h2)
    node[midway,below] () {$H_2(\mathbf{w}|\mathbf{x})$};
  }%
  \tikz[remember picture, overlay, color=mpl_blue]{
    \draw[-latex, line width=1pt]
    ([yshift=.35\linewidth, xshift=.05\linewidth]pic cs:moons-h2)
    --([yshift=.35\linewidth, xshift=-.03\linewidth]pic cs:moons-2)
    node[midway,above] () {$H^{-1}_2(\mathbf{z}|\mathbf{x})$};
  }%
\end{figure}
%
For $h_2$, flexible transformation functions like Bernstein polynomials or splines can be used.

We combine ideas from \glspl{CF} and \glspl{MAF} to construct the conditioner.
Similar to \glspl{CF}, we do not apply any transformation to $w_1$ but pass it as a conditional input to the masked neural network that estimates the parameters of the subsequent transformations $\bm{\Psi}(\mathbf{w},\mathbf{x})=\left(\bm{\psi}_{\mathbf{x}},\bm{\psi}_{\mathbf{w}_{<1}, \mathbf{x}},\ldots,\bm{\psi}_{\mathbf{w}_{<J}, \mathbf{x}} \right)^\top$.
Binary masking, similar to \gls{MADE}, ensures that the $j$th parameters do not depend on $\mathbf{z}_{1,\ge j}$.
Autoregressive transformations can be stacked if a single transformation is not expressive enough.
Through this autoregressive transformation, we aim to ``de-correlate'' the elements of $\mathbf{W}$ such that $\mathbf{Z}$ follows a multivariate standard normal distribution $\mathcal{N}_{J}(\mathbf{0},\mathbf{I})$.

The joint density of the original response vector $\mathbf{Y}$ is then:\footnote{We use the short notation $H_1\left(\mathbf{y},\mathbf{\Theta}\left(\mathbf{x}\right)\right) = H_1(\mathbf{y}|\mathbf{x})$ and $H_2\left(\mathbf{w}, \mathbf{\Psi}\left(\mathbf{w}, \mathbf{x} \right) \right) = H_2(\mathbf{w}|\mathbf{x})$.}
\begin{equation*}
  \begin{aligned}
    f_{\mathbf{Y} | \mathbf{X}}\left(\mathbf{y} | \mathbf{x}\right)
     & = f_{Z}\left(H\left(\mathbf{y} | \mathbf{x} \right) \right)
    \left| \nabla H\left(\mathbf{y} | \mathbf{x} \right) \right|             \\
     & = f_{Z}\left(
    H_2\left(
      H_1\left(\mathbf{y}|\mathbf{x}\right) | \mathbf{x}
      \right)
    \right)                                                                  \\
     & \quad  \cdot \left|
    \nabla H_2\left(
    H_1\left(\mathbf{y}|\mathbf{x}\right)
    |\mathbf{x}
    \right) \nabla H_1\left(\mathbf{y}|\mathbf{x}\right)
    \right|.
  \end{aligned}
\end{equation*}
%

\paragraph{Model Training and Inference}

We train the hybrid model by minimizing the conditional negative log-likelihood of the data $\mathcal{D}$,
\begin{equation*}
  \begin{aligned}
%    \hat{\bm{\omega}} & = \argmin_{\bm{\omega}\in\bm{\Omega}}\left\{
   \nll(\bm \omega) &= -\sum_{(\mathbf{y}, \mathbf{x})\in\mathcal{D}}
    \log f_{\mathbf{Y} | \mathbf{X}}\left(\mathbf{y} | \mathbf{x}\right)\\
   % \right\}                                                                                     \\
                       &= %\argmin_{\bm{\omega}\in\bm{\Omega}}\left\{
    -\sum_{(\mathbf{y}, \mathbf{x})\in\mathcal{D}}
    \log f_Z\left(H\left(\mathbf{y} | \mathbf{x}, \bm{\omega} \right) \right)
    \left| \nabla H\left(\mathbf{y} | \mathbf{x}, \bm{\omega} \right) \right|,
   % \right\}                                                                   %                  \\
                  %    & = \argmin_{\bm{\omega}\in\bm{\Omega}}\left\{ \nll(\bm{\omega}) \right\},
  \end{aligned}
\end{equation*}
to optimize the weights $\bm{\omega}$ of the conditioning functions $\bm{\Theta}_x$ and $\bm{\Psi}_x$ parameterizing the normalizing flow $H\left(\mathbf{y} | \mathbf{x}, \bm{\omega} \right) = H_2 \circ H_1(\mathbf{y} | \mathbf{x}, \bm{\omega})$.

This involves optimizing the parameters of the marginal transformation functions $H_1$ and the parameters of the autoregressive flow $H_2$, which are controlled by masked neural networks. Optimization can be done using any suitable gradient-based optimization algorithm, such as Adam~\citep{Kingma2017a}.
%
Once the model is trained, we can sample from the joint distribution of $\mathbf{Y} | \mathbf{X}$ by:
\begin{enumerate}
  \item Sample $\mathbf{z}$ from the base distribution $F_Z$.
  \item Apply the inverse autoregressive flow to obtain $\mathbf{w} = H_2^{-1}(\mathbf{z}|\mathbf{x})$.
  \item Apply the inverse marginal transformation to obtain $\mathbf{y} = H_1^{-1}(\mathbf{w}| \mathbf{x})$.
\end{enumerate}
In short: $\mathbf{y}=H_1^{-1}(H_2^{-1}(\mathbf{z}|\mathbf{x})|\mathbf{x})$ with $\mathbf{z} \sim F_Z$.

\subsection{Interpretability}
Our hybrid approach combines \gls{CTM}'s interpretability with the flexibility of autoregressive \glspl{NF}:
The marginal transformation step ($H_1$) may use shifted Bernstein polynomials, which allows quantifying feature effects on the marginal distribution, similar to coefficients in linear models or \glspl{GAM} \citep{Wood2017}.
For instance, with a logistic base distribution, linear effect coefficients in the \gls{SAP} can be interpreted as linear changes on the level of log-odds ratios (see \Cref{sec:sap,sec:interpretation_details} for more details).
%While the masked autoregressive flow in step two ($H_2$) models complex dependencies, it lacks the direct interpretability of marginal transformations.
%
Our model prioritizes marginal interpretability while accepting a trade-off for the dependence structure to gain flexibility via the masked autoregressive flow in step two ($H_2$).
It is suitable for scenarios requiring understanding individual feature effects while modeling complex relationships between response variables.
Understanding how features affect the dependence structure in this step is challenging and presents another open research question.
%For example, one could analyze how the learned flow parameters vary with different feature values or employ techniques from~\gls{xAI} to gain insights into the model's behavior.

\subsection{Relation to Copula Methods}
\label{sec:copula}

%The autoregressive flow ($H_2$) implicitly models this copula density by transforming the normalized responses into a multivariate normal distribution.
%Traditional \glspl{MCTM} use a linear transformation for the copula, while our approach uses a more flexible non-linear transformation.
%Further insights on the relation to copula methods is given in \Cref{sec:copula}.

\begin{figure*}[t]%
\centering 
  \begin{subfigure}[t]{0.3\linewidth}
    \centering
    \includegraphics[width=\linewidth]{gfx/moons_y.png}
    \caption{Original Data}
  \end{subfigure}
  \hfil%
  \begin{subfigure}[t]{0.3\linewidth}
    \centering
    \includegraphics[width=\linewidth]{moons_w.png}
    \caption{Normalized Marginals}
  \end{subfigure}
  \hfil%
  \begin{subfigure}[t]{0.3\linewidth}
    \centering
    \includegraphics[width=\linewidth]{moons_pit.png}
    \caption{Uniform Marginals}
  \end{subfigure}
  \caption{Illustration of the hybrid approach on the Moons dataset.
    (a) The original data exhibits a non-linear dependency structure.
    (b) After applying $H_1$, the marginal distributions follow a normal distribution.
    (c) The autoregressive flow $H_2$ further transforms the data to obtain approximately independent uniform marginals, implicitly modeling the copula function.}
  \label{fig:copula_illustration}
\end{figure*}

Copulas model multivariate distributions by separating marginal distributions from the dependence structure.
Sklar's Theorem states that a copula function can express any \gls{CDF}. 
Let $F_{\mathbf{Y} | \mathbf{X}}(\mathbf{y} | \mathbf{x})$ be the joint \gls{CDF} of the response vector $\mathbf{Y} = (Y_1, \dots, Y_J)^\top$ given features $\mathbf{X}$.
Sklar's theorem implies a copula function $C(u_1, \dots, u_J | \mathbf{x})$ such that:
\begin{align*}
  &F_{\mathbf{Y} | \mathbf{X}}(y_1, \dots, y_J | \mathbf{x})\\
  &= C(F_{Y_1 | \mathbf{X}}(y_1 | \mathbf{x}), \dots, F_{Y_J | \mathbf{X}}(y_J | \mathbf{x}) | \mathbf{x}),
\end{align*}
where $u_j = F_{Y_j | \mathbf{X}}(y_j | \mathbf{x})$ are uniform marginal \glspl{CDF}.
The copula $C$ is a multivariate \gls{CDF} on $[0,1]^J$ with uniform marginals.
Our hybrid approach directly relates to this.
The first step ($H_1$) models marginals $F_{Y_j|\mathbf{X}}$ and transforms them to the base distribution $F_Z$.
Applying the \gls{PIT}, $u_j = F_Z(z_{1j})$, yields uniform marginals.



The copula density $c$ of $\bm{u}=(u_1, \dots, u_J)^\top$ is the ratio of the joint density and the product of the marginal densities:
\begin{equation*}
  c(\bm{u} | \mathbf{x}) = \frac{f_{\mathbf{Y} | \mathbf{X}}(F^{-1}_{Y_1 | \mathbf{X}}(u_1 | \mathbf{x}), \dots, F^{-1}_{Y_J | \mathbf{X}}(u_J | \mathbf{x}) | \mathbf{x})}{\prod_{j=1}^J f_{Y_j | \mathbf{X}}(F^{-1}_{Y_j | \mathbf{X}}(u_j | \mathbf{x}) | \mathbf{x})}.
\end{equation*}

\section{Numerical Experiments}
\label{sec:experiments}

Next, we evaluate our method using simulated and real-world datasets.
%Implementations are available on GitHub\footnote{\url{https://github.com/MArpogaus/hybrid_flows}}.
All models are implemented using \texttt{TensorFlow} (v2.15.1) and \texttt{TensorFlow Probability} (v0.23.0) and trained using Adam~\citep{Tensorflow,TensorflowProbability,Kingma2017a}.
In our experiments, hyperparameters were chosen based on our previous experience and settings suggested in related works.
However, we did some automated hyperparameter tuning (using Optuna) on some models applied to the simulated 2D data with performance monitored on a held-out validation set. This was mainly done to understand the influence of certain parameters, but it was not necessary to obtain a good model for each data set.
We found that high-order Bernstein polynomials are needed for some datasets, suggesting the fixed knot placement of Bernstein polynomials could limit their flexibility\cite[see][for an ablation to the sensitivity of the Bernstein order]{Hothorn2018}.
%The difference in smoothness warrants further investigation and could be related to the inherent properties of Bernstein polynomials as approximators, with equidistant knots.

Splines, on the other hand, might be more suitable for representing highly nonlinear or discontinuous functions, leading to a better fit on certain datasets.
All hyperparameters used to generate the results presented are documented in the supplementary material and in the GitHub repository\footnote{\url{https://github.com/MArpogaus/hybrid_flows}}.
This includes the code to perform the hyperparameter tuning along with the defined search spaces.

\subsection{Benchmark Datasets}

We evaluate our method on five common benchmark datasets POWER, GAS, HEPMASS, MINIBOONE, and BSDS300 (see Appendix \Cref{sec:benchmark_data}).
We follow the preprocessing steps as in \citet{Papamakarios2018} and compare a \gls{MAF} and our proposed \gls{HMAF}.
Both models utilize masked autoregressive transformations.
Each \gls{MAF} layer uses a masked affine transformation, parameterized by a masked neural network followed by an invertible linear transformation (1x1 convolution) initialized with a random (non-trainable) permutation.
For the \gls{HMAF}, we use a marginal transformation step with Bernstein polynomials before the \glspl{MAF} layers.

% 

For both \glspl{MAF} and \gls{HMAF}, we employ Rational Quadratic Splines (RQS) as the transformation functions within the \glspl{MAF} layers of $H_2$.  Hyperparameters, including the number of \glspl{MAF} layers, the number of hidden units in the \gls{MADE} networks, and the number of bins for the RQS transformations, were chosen based on the values reported in \citet{Durkan2019} and are detailed in the supplementary material and the GitHub repository.
All models were trained using the Adam optimizer with early stopping after $50$ epochs without improvements and a cosine learning rate schedule.

\begin{table*}[htb!]
  \caption{
    Test negative log-likelihood comparison against the state-of-the-art on real-world datasets (lower is better).\\
    Values are averaged over 20 trials, and their spread is reported as two standard deviations.
  }
  \label{tab:benchmark-nll}
\resizebox{0.99\textwidth}{!}{
    \begin{tabular}{l|llllll}
      \toprule
      model                          & dataset name    & bsds300              & gas                 & hepmass            & miniboone          & power              \\
      \midrule
      \multirow[c]{3}{*}{\gls{HMAF}} & test loss       & -153.663 $\pm$ 0.037 & -11.625 $\pm$ 0.209 & 18.103 $\pm$ 0.058 & 12.057 $\pm$ 0.092 & -0.527 $\pm$ 0.003 \\
                                     & train loss      & -165.274 $\pm$ 0.124 & -11.790 $\pm$ 0.230 & 17.856 $\pm$ 0.063 & 8.715 $\pm$ 0.415  & -0.573 $\pm$ 0.004 \\
                                     & validation loss & -168.769 $\pm$ 0.035 & -11.621 $\pm$ 0.210 & 18.123 $\pm$ 0.057 & 11.519 $\pm$ 0.070 & -0.537 $\pm$ 0.003 \\
      \midrule
      \multirow[c]{3}{*}{\gls{MAF}}  & test loss       & -155.057 $\pm$ 0.065 & -11.781 $\pm$ 0.032 & 18.090 $\pm$ 0.039 & 12.030 $\pm$ 0.073 & -0.541 $\pm$ 0.003 \\
                                     & train loss      & -167.694 $\pm$ 0.293 & -11.881 $\pm$ 0.032 & 17.868 $\pm$ 0.041 & 9.605 $\pm$ 0.048  & -0.606 $\pm$ 0.003 \\
                                     & validation loss & -170.188 $\pm$ 0.085 & -11.779 $\pm$ 0.031 & 18.102 $\pm$ 0.040 & 11.546 $\pm$ 0.048 & -0.551 $\pm$ 0.003 \\
      \bottomrule
    \end{tabular}
}
\end{table*}

\Cref{tab:benchmark-nll} presents the test log-likelihoods achieved by our models.
Overall, \gls{HMAF} demonstrates competitive performance compared to \glspl{MAF}.
%The results suggest that incorporating marginal transformations can improve performance.
%However, the need for high-order Bernstein polynomials in some cases highlights the potential limitations of using fixed knot placement.
Further investigation into more flexible marginal transformations could lead to even better performance and broader applicability of hybrid flow-based models.
See \Cref{sec:benchmark_data} for a visualization of individual model runs.

\subsection{Simulated Data}
\label{sec:simulated_data}

We start with the classical bivariate (i.e. $J=2$) \glspl{NF} datasets, \emph{moons} and \emph{circles}, exhibiting non-linear dependencies.
Each dataset has $16,384$ data points (we reserved 25\% for validation) generated using \texttt{scikit-learn}~\citep{scikit-learn}.
A binary feature $x$ based on spatial location is introduced.
For \emph{moons}, $x=1$ indicates the lower-right moon, and $x=0$ the upper-left.
For \emph{circles}, $x=1$ denotes the inner circle, and $x=0$ the outer.
This assesses the models' ability to capture $f(\mathbf{y}|x)$.
We compare the following models:
\begin{itemize}
  \item \textbf{\acrfull{MVN}} Assumes a conditional multivariate normal distribution parameterized by $x$ using a connected network (two layers, 16 units, ReLU activation).

  \item 
\textbf{\acrfull{MCTM}:} Employs Bernstein polynomials of order $M=300$ for marginal transformations and a triangular matrix $\Lambda$ for the dependency structure. In the conditional case, the marginal shift parameter $\beta_x$ and $\Lambda_x$ are obtained using Bernstein polynomials of order 6, as in \citet{Klein2022}.
%
        % \begin{equation*}
        %   h_j(y_j|\mathbf{x}) = \lambda_{j,\mathbf{x}}\left(\alpha_j(y_j)^\top \bm{\vartheta}_j + \beta_{j,\mathbf{X}}\right) \\
        % \end{equation*}
%
        While this approach offers interpretability through \gls{SAP}, the linear dependence structure constrains its capacity.

  \item \textbf{\acrfull{CF} and \acrfull{MAF}:} These consist of two stacked coupling or masked autoregressive layers, parameterized by fully connected or masked neural networks, respectively (tree layers, 128 units, ReLU activation).
        Conditional models incorporate $x$ as input to an FCN whose output is added to the first layer's parameters. We use Bernstein polynomials of order $M=300$ (\gls{CF}~(B), \gls{MAF}~(B)) and quadratic splines with 32 bins (\gls{CF}~(S), \gls{MAF}~(S)) as transformation functions.

  \item \textbf{\acrfull{HCF}:} Combines elementwise Bernstein polynomials of order $M=300$ for marginals (similar to \glspl{MCTM}) with a single coupling layer to model dependencies.
        Again Bernstein polynomials (B) and quadratic splines (S) are compared as transformation functions in the coupling layer.
        As the data exhibits a very complex distribution, simple shift terms are not enough to model the feature effect in the marginals.
        Instead, class-specific coefficients are used to model the feature effect on the Bernstein coefficients.
\end{itemize}

All models use the Adam optimizer~\citep{Kingma2017a} with early stopping and cosine learning rate decay~\citep{Loshchilov2017}.
All chosen hyperparameters are reported in the supplementary material.
%Common hyperparameters include epochs (200 for \gls{MVN}, 400 for others), batch size (256 for \gls{MVN}, 512 for others), and initial learning rate (0.01, 0.001).
%For \gls{HCF} (B) models, we performed a hyperparameter search using Optuna~\citep{Optuna} (\Cref{tab:hcfb_hyperparameter_search}).
%Other models used manually selected hyperparameters (Appendix \ref{sec:model_details}).

\newsavebox{\mycolumnbox}
\begin{figure}[htb!]
  \centering
  % Moons Conditional
  \setlength{\tabcolsep}{1pt}
  \begin{tabular}{>{\begin{lrbox}{\mycolumnbox}}%
               l%
               <{\end{lrbox}\rotatebox[origin=c]{90}{\textbf{\unhbox\mycolumnbox}}}%
      cc|cc}
            & Uncond.
            & Cond.
            & Uncond.
            & Cond.
    \\
    MVN     & \pdfplotcolumns{circles}{multivariate_normal_contour}
            & \pdfplotcolumns{moons}{multivariate_normal_contour}                           \\
    MCTM    & \pdfplotcolumns{circles}{multivariate_transformation_model_contour}
            & \pdfplotcolumns{moons}{multivariate_transformation_model_contour}             \\
    CF (S)  & \pdfplotcolumns{circles}{coupling_flow_quadratic_spline_contour}
            & \pdfplotcolumns{moons}{coupling_flow_quadratic_spline_contour}                \\
    CF (B)  & \pdfplotcolumns{circles}{coupling_flow_bernstein_poly_contour}
            & \pdfplotcolumns{moons}{coupling_flow_bernstein_poly_contour}                  \\
    MAF (S) & \pdfplotcolumns{circles}{masked_autoregressive_flow_quadratic_spline_contour}
            & \pdfplotcolumns{moons}{masked_autoregressive_flow_quadratic_spline_contour}   \\
    MAF (B) & \pdfplotcolumns{circles}{masked_autoregressive_flow_bernstein_poly_contour}
            & \pdfplotcolumns{moons}{masked_autoregressive_flow_bernstein_poly_contour}     \\
    \midrule
    HCF (S) & \pdfplotcolumns{circles}{hybrid_coupling_flow_quadratic_spline_contour}
            & \pdfplotcolumns{moons}{hybrid_coupling_flow_quadratic_spline_contour}         \\
    HCF (B) & \pdfplotcolumns{circles}{hybrid_coupling_flow_bernstein_poly_contour}
            & \pdfplotcolumns{moons}{hybrid_coupling_flow_bernstein_poly_contour}           \\
  \end{tabular}
  \caption{Estimated densities from different models fitted to simulated 2D data.
    Left two columns: Circles dataset, Right two columns: Moons dataset.
    Each cell shows the estimated density function, conditioned on $x$ (left) and not conditioned on $x$ (right).
    The conditional variable $x$ is 1 for the inner circle and the lower right moon, respectively.}
  \label{fig:simulated_data}
\end{figure}

\paragraph{Results}
\Cref{fig:simulated_data} visualizes the estimated densities for different models on both the \emph{circles} and \emph{moons} datasets. Each cell displays contour plots of the estimated densities, both conditional (left) and unconditional (right).
As expected, the \gls{MVN} and the \gls{MCTM} struggle with the non-linear dependencies.
The \gls{MVN} is restricted to elliptical distributions, while the \gls{MCTM}, despite capturing marginals well, fails to represent the complex joint distribution due to its linear dependence structure.
In contrast, \glspl{CF}, \glspl{MAF}, and \gls{HCF} successfully capture the datasets' non-linear shapes, demonstrating their ability to model complex distributions.
Interestingly, models using Bernstein polynomials (\gls{CF}~(B), \gls{MAF}~(B), \gls{HCF}~(B)) tend to produce slightly blurrier results, indicating a bias towards smoother distributions compared to the spline-based models.
Among the flexible models, quadratic spline transformations generally show a slight performance advantage over Bernstein polynomials for the \glspl{CF} and \glspl{MAF} architectures in terms of capturing fine-grained details of the data distribution, although the \gls{HCF} models seem to perform equally well with both transformations.
%The difference in smoothness warrants further investigation and could be related to the inherent properties of Bernstein polynomials as approximators, with equidistant knots.
%Splines, on the other hand, might be more suitable for representing highly non-linear or discontinuous functions in these specific datasets.

\Cref{tab:simulated_results} (in \Cref{sec:sim-nll}) provides a quantitative comparison, reporting the average test \gls{NLL} across 20 trails.
The results corroborate the visual observations from \Cref{fig:simulated_data}.
\gls{MVN} and \gls{MCTM} have significantly higher \gls{NLL} values, indicating their inadequacy for these complex distributions.
\glspl{CF} and \glspl{MAF}, especially with spline transformations, consistently achieve lower \gls{NLL}.
This difference underscores the importance of flexible dependency modeling.
The \gls{HCF} models %, while not achieving the lowest \gls{NLL} values, still 
perform well and demonstrate the effectiveness of the hybrid approach.
The differences of Bernstein polynomial models in \Cref{fig:simulated_data} are directly reflected in lower \gls{NLL} values, suggesting that these models lack certain flexibility to represent the data distribution.
Consistent with the visualizations, including feature information improves model performance across all models, leading to consistently lower \gls{NLL} values in the conditional setting.
This improvement highlights the models' ability to leverage the feature to better capture the conditional structure of the data.
%The relatively large standard deviations in the NLL values, particularly for some \glspl{CF} models, might indicate suboptimal choice of hyper parameters like learning rate for these types of models, which could be investigated by more thorough hyper parameter tuning.

\subsection{Malnutrition Data}

Finally, we evaluate our approach using a real-world dataset on childhood malnutrition in India, with three anthropometric indices (\texttt{stunting}, \texttt{wasting}, and \texttt{underweight}) as response variables.
The goal is to model the joint distribution conditional on the child's age (\texttt{cage}).
We follow the data preprocessing steps from \citet{Klein2022}.

We fit three models, all using Bernstein polynomials of order $M=6$ for the marginal transformations with a linear feature shift.
\begin{itemize}
  \item \gls{MCTM}: Identical to the model specification in \citet{Klein2022}, this model combines marginal transformations with a triangular matrix $\Lambda$.
  We capture the influence of \texttt{cage} on both the marginals (using linear shifts) and on the elements of $\Lambda$ using Bernstein polynomials of order $M=6$, i.e.\ the marginal shift is modeled as Bernstein polynomial $\bm{\beta}_x = \alpha(\text{cage})^\top \bm{\vartheta}$, where $\alpha$ represents the Bernstein basis, and $\bm{\vartheta}$ are the coefficients. 

  \item \acrfull{HMAF}: For our approach, the \glspl{MAF} for $H_2$ uses either Bernstein polynomials (B) or quadratic splines (S).
        We apply $H_2$ to $\mathbf{y}_{j>1}$, as $y_1$ is already normalized by $H_1$.
        Each \glspl{MAF} layer is parameterized by a masked neural network conditional on \texttt{cage}. To capture dependencies between $y_1$ and $\mathbf{y}_{j>1}$, an additional fully connected neural network is used to condition the output of the \gls{MADE} networks on $y_1$.
\end{itemize}

\paragraph{Results}\Cref{tab:malnutrition_results} presents average test \gls{NLL}.
\gls{HMAF} variants, especially \gls{HMAF}~(S), outperform \gls{MCTM}, indicating that the non-linear dependence modeling of the \glspl{MAF} is crucial.
Additionally, this is confirmed by \Cref{fig:malnutrition_samples} in \Cref{sec:malnutrition_samples} via scatter, pairwise density, and marginal density plots of the dataset as well as samples drawn from the three models.
The plots indicate that the \gls{MCTM} captures marginals well, but fails to model dependencies.
The \gls{HMAF} model shows a greater resemblance to the observed data in the pairwise density and scatter plots.

\begin{table}[htb]
  \centering
  \caption{Average test \gls{NLL} on the malnutrition data.
    Lower values indicate better performance. Values are averaged over 20 trials, and their spread is reported as two standard deviations.}
  \label{tab:malnutrition_results}
  \begin{tabular}{ll}
    \toprule
    model & test loss \\
    \midrule
    MCTM & 3.470 $\pm$ 0.026 \\
    HMAF (S) & 0.687 $\pm$ 0.206 \\
    HMAF (B) & 2.062 $\pm$ 0.038 \\
    \bottomrule
  \end{tabular}
\end{table}

\paragraph{Marginal distributions}
%\label{sec:malnutrition_marginals}
To assess whether the model captures the marginal distribution,  \citet{Hothorn2014} suggest using \gls{QQ} plots to verify $\mathbf{W}$ follows the base distribution $F_W=\mathcal{N}(0,1)$. \Cref{fig:qq_w_base} shows the empirical quantile function of marginally normalized samples $\mathbf{W}$ from the validation set plotted against the quantile function of a standard normal distribution evaluated at 200 equidistant probabilities. The results suggest a good fit of the transformation $H_1$, with slight underestimations of the marginals.

\begin{figure}[htb!]
  \centering
  \includegraphics[width=\linewidth]{malnutrition_qq_w_base_seeds.pdf}
  \caption{\gls{QQ} plots of transformed samples against a standard normal distribution.
    Deviations from the diagonal indicate non-normality.
    The solid line represents the mean, while the shaded area indicates the 95\% probability intervals obtained from 20 trials of randomly initialized models.}
  \label{fig:qq_w_base}
\end{figure}

To assess the influence of $H_2$ on the marginals, we plotted the empirical quantiles of the dataset against an equal number of samples from all three models in \Cref{fig:qq_data_samples}.
All three models leave the normalized samples for \texttt{stunting} unchanged while introducing more deviations for \texttt{wasting} and \texttt{underweight}.
This deviation is especially large for the \gls{MCTM} models. The \gls{HMAF}, particularly the spline variants, generally perform well but struggle with upper tails.

\begin{figure}[htb!]
    \centering
    \includegraphics[width=\linewidth]{malnutrition_qq_data_samples_seeds.pdf}
    \caption{\gls{QQ} plot comparing empirical quantiles of the dataset with those generated by the three models.
    The solid lines represent the mean, while the shaded areas indicate the 95\% probability intervals obtained from 20 trials of randomly initialized models.}
    \label{fig:qq_data_samples}
\end{figure}

\subsection{Interpretability}

A crucial aspect of our proposed hybrid approach is its ability to model marginal distributions in an interpretable manner while simultaneously capturing complex dependencies. Using the Malnutrition data, we demonstrate the model's interpretability in understanding learned relationships.

\paragraph{Feature-induced Changes in Marginal Distribution}\Cref{fig:marginal_effects_comparison} shows the estimated marginal \glspl{CDF} $F(y_j|\text{cage})$ and \glspl{PDF} $f(y_j|\text{cage})$.
The plots indicate non-linear shifts towards lower values as the age of the child increases indicating a deteriorating nutrition status according to all three malnutrition indicators. This overall trend can be explained by the fact that most children are born with a close-to-normal nutritional status but the effects of resource-scarce environments become increasingly relevant as children age.
%
\begin{figure}[tb!]
  \centering
  \includegraphics[width=\columnwidth]{malnutrition_conditional_multivariate_transformation_model_distribution.pdf}
  \caption{Comparison of marginal effect on the \glspl{PDF} and \glspl{CDF} of \texttt{stunting}, \texttt{wasting}, and \texttt{underweight} with respect to \texttt{cage}.}
  \label{fig:marginal_effects_comparison}
\end{figure}
%

\begin{figure}[h!]
    \centering
    \includegraphics[width=\columnwidth]{malnutrition_marginal_shift_inv_seeds.pdf}
    \caption{Inverse marginal shift $-\bm{\beta}_x$. The plot demonstrates the relationship between marginal shifts and \texttt{cage}.
    The solid line represents the mean and the shaded area the 95\% probability intervals over 20 trials of randomly initialized models.}
    \label{fig:marginal_shift}
\end{figure}

\paragraph{Inverse Marginal Shift Terms} A more nuanced interpretation can be achieved by depicting the inverse marginal shift terms $\bm{\beta}_x$ as in \Cref{fig:marginal_shift}.

The plot shows the complex non-linear change in the nutritional status with increasing age, but also highlights that the change is more pronounced for stunting and underweight. This coincides with the intuition that acute malnutrition (as measured by the stunting indicator) materializes more quickly than chronic malnutrition (as measured by the wasting indicator). Underweight represents a mixture of both acute and chronic malnutrition, which again fits with the estimated shift term.
%


\section{Discussion}
\label{sec:discussion}
We introduce a hybrid approach for density regression that combines the strengths of~\glspl{MCTM} and autoregressive~\glspl{NF}.
%The resulting \gls{HMAF} allows for flexible modeling of the dependency structure while retaining the interpretability of structured additive predictors for the marginal distributions.
In the first step of our \gls{HMAF} approach, we model the marginal distributions using a transformation model with interpretable structured additive predictor.
This enables transparent modeling of feature effects on the marginal distribution.
In the second step, we use an autoregressive \gls{NF} to effectively capture complex, non-linear dependencies between the response variables while preserving the marginals.
Our results on simulated and real-world datasets showed that our method is competitive with state-of-the-art methods.

The hybrid approach offers 1) interpretability, by
%\begin{description}
  % \item[Interpretability:] 
  transparently modeling of feature effects using structured additive predictors, 2) flexibility, by
  % \item[Flexibility:] 
  capturing complex non-linear dependencies using an autoregressive flow, and 3) efficiency, by 
  % \item[Efficiency:] 
  offering fast computation of log-likelihoods and gradients due to the \gls{MADE} architecture.
% \end{description}

% The reduced interpretability of feature effects on the dependence structure modeled by the autoregressive flow is a limitation.
% Future work may explore methods to interpret the flow parameters or employ \gls{xAI} techniques.
% Future research should focus on methods for directly interpreting feature effects on dependency, extend the approach to discrete outcomes, and investigate different flow architectures.
%%% Local Variables:
%%% mode: LaTeX
%%% TeX-master: "uai_main"
%%% End:
