%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}with

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Include other packages here, before hyperref.
\usepackage{graphicx}
%\usepackage{amsmath}
%\usepackage{amssymb}
%-------------------------------------------------------------------------
\usepackage{bm}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{multirow}
\usepackage{xcolor}
\usepackage[normalem]{ulem}
\usepackage[accsupp]{axessibility}

\usepackage{multicol}
\usepackage{multirow}
\usepackage{subfigure}
\usepackage{xpatch}
\usepackage{nicefrac}  % compact symbols for 1/2, etc.
\usepackage{tabularx}
\usepackage{colortbl}
\usepackage{pifont}

\newcommand{\eg}{e.g.,~}
\newcommand{\etc}{etc.}
\newcommand{\ie}{i.e.~}
\newcommand{\wrt}{w.r.t.~}
\newcommand{\fst}{\textbf}
\newcommand{\scd}{\textun}
\newcommand{\pp}{p.p.}
%
\newcommand{\concat}{\textsc{Concat}}
\newcommand{\osplit}{\textsc{Split}}
\newcommand{\neuralNet}{\textsc{NN}}
\newcommand{\condneuralNet}{\textsc{CN}}
%
\newcommand{\cmark}{\ding{51}}
\newcommand{\xmark}{\ding{55}}
\newcommand{\textun}[1]{\underline{#1}}
%
\DeclareMathOperator*{\argmax}{arg\,max}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Ls}{\mathcal{L}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\B}{\mathbb{B}}
% Vectors
\def\vlambda{{\bm{\lambda}}}
\def\vtheta{{\bm{\theta}}}
\def\vgamma{{\bm{\gamma}}}
\def\vc{{\bm{c}}}
\def\vu{{\bm{u}}}
\def\vv{{\bm{v}}}
\def\vs{{\bm{s}}}
\def\vt{{\bm{t}}}
\def\vx{{\bm{x}}}
\def\vz{{\bm{z}}}
% Sets
\def\sN{{\mathbb{N}}}
\def\sR{{\mathbb{R}}}
\def\sZ{{\mathbb{Z}}}
% Matrix
\def\mJ{{\bm{J}}}
\def\mM{{\bm{M}}}
\def\mW{{\bm{W}}}
% Graph
\def\gC{{\mathcal{C}}}
\def\gS{{\mathcal{S}}}
\def\gU{{\mathcal{U}}}
\def\gV{{\mathcal{V}}}
\def\gX{{\mathcal{X}}}

%-------------------------------------------------------------------------
\title{ContextFlow++: Generalist-Specialist Flow-based Generative Models
	\\ with Mixed-Variable Context Encoding}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<denis.gudovskiy@us.panasonic.com>?Subject=Your UAI 2024 paper}{Denis Gudovskiy}{}}
\author[2]{Tomoyuki Okuno}
\author[2]{Yohei Nakata}
% Add affiliations after the authors
\affil[1]{%
	Panasonic AI Lab, Mountain View, CA, USA}
\affil[2]{%
	Panasonic Holdings Corporation, Osaka, Japan}


\begin{document}
\maketitle

\begin{abstract}
Normalizing flow-based generative models have been widely used in applications where the exact density estimation is of major importance. Recent research proposes numerous methods to improve their expressivity. 
However, conditioning on a context is largely overlooked area in the bijective flow research. Conventional conditioning with the vector concatenation is limited to only a few flow types. 
More importantly, this approach cannot support a practical setup where a set of context-conditioned (\textit{specialist}) models are trained with the fixed pretrained general-knowledge (\textit{generalist}) model. We propose ContextFlow++ approach to overcome these limitations using an additive conditioning with explicit generalist-specialist knowledge decoupling. Furthermore, we support discrete contexts by the proposed mixed-variable architecture with context encoders. Particularly, our context encoder for discrete variables is a surjective flow from which the context-conditioned continuous variables are sampled. Our experiments on rotated MNIST-R, corrupted CIFAR-10C, real-world ATM predictive maintenance and SMAP unsupervised anomaly detection benchmarks show that the proposed ContextFlow++ offers faster stable training and achieves higher performance metrics. Our code is publicly available at \href{https://github.com/gudovskiy/contextflow}{github.com/gudovskiy/contextflow}.
\end{abstract}

\section{Introduction}\label{sec:intro}
Recently, probabilistic generative models \citep{NIPS2014_d523773c} have gained attention as a solution for challenges in many fields \eg molecular discovery \citep{molecular} and high-resolution image synthesis \citep{rombach2021highresolution}. An important class of such models are the bijective normalizing flows. Unlike variational autoencoders (VAEs) \citep{Kingma2013AutoEncodingVB} and diffusion models \citep{sohl2015diffusion}, normalizing flows can estimate data likelihoods exactly. Therefore, flows are widely-used in semi-supervised prediction \citep{izmailov2020semi}, time series forecasting \citep{rasul2021multivariate}, unsupervised anomaly detection in computer vision \citep{cflow_ad},  molecular graph generation \citep{Kuznetsov2021MolGrowGraphNormalizing} \etc

\begin{figure}[t]
	\begin{center}
		\centering
		\includegraphics[width=0.92\columnwidth]{figs/fig-problem.pdf}
		\caption{Normalizing flows implement a layered bijective transformations $f_{\vtheta_l}$ between a target data $p(\vv)$ distribution and a base $p(\vu)$ distribution using learned parameters $\vtheta_l$. A trained model $f_{\vtheta}$ usually predicts an outcome $p_{\vtheta}(y | \vv)$ (right) or samples data using the learned $p_{\vtheta}(\vv | \vu)$ (left). When additional conditioning is needed to model $p(\vv, \vc)$, the conventional approach with concatenated vectors $[ \vv_l, \vc ]$ is limited in the type of supported bijections and lacks the support of \textit{generalist-specialist} training setup.}
		\label{fig:problem}
	\end{center}
\end{figure}

Current research on normalizing flows mostly aims to improve their likelihood estimation and sampling in various data domains \citep{Kobyzev2020NormalizingFA}. However, the conditioning in these models is a largely overlooked area. In particular, the conditioning is typically limited to concatenation of input data and context vectors \citep{lu2020structured}. At the same time, recent works on diffusion models show benefits of a more sophisticated ControlNet-style context-conditioning \citep{Zhang_2023_ICCV} with the \textit{generalist-specialist} setup.

\begin{table*}[ht!]
	\caption{Previous (top) and the proposed (bottom) conditional bijections. A conditioning network (CN) processes contexts $\vc$. Previous methods either concatenate CN outputs with the internal RealNVP vectors for neural network (NN) processing or use only CN outputs in bijections. ContextFlow decouples CN and NN outputs using the additive operation while preserving bijection property. Symbols indicate: $\odot$ for element-wise multiplication and $\oslash$ for the division, \cmark~for "yes" and \xmark~for "no".}
	\label{tab:prior}
	\centering
	\small
	\begin{tabular}{c|c|c|c|c}
		\toprule
		\multirow{2}{*}{\shortstack{Conditional \\ Transformation}} & \multirow{2}{*}{\shortstack{Inverse \\ $ f^{-1} : \vv,\vc \rightarrow \vu $}} & \multirow{2}{*}{\shortstack{Forward \\ $ f : \vu,\vc \rightarrow \vv $}} & \multirow{2}{*}{\shortstack{Bijective}} & \multirow{2}{*}{\shortstack{Generalist \\ -specialist}} \\
		& & & & \\
		\midrule
		\multirow{3}{*}{\shortstack{RealNVP coupling\\ \citep{winkler2019learning}, \\ \citep{ardizzone2019guided}}} & 
		\multirow{3}{*}{\shortstack{$\left[\vv_a, \vv_b \right] = \osplit(\vv)$ \\ 
				$\left[\vs, \vt\right] = \neuralNet(\left[ \vv_b, \condneuralNet(\vc) \right] )$ \\
				$\vu = \left[\vs \odot \vv_a + \vt, \vv_b \right] $}} & 
		\multirow{3}{*}{\shortstack{$\left[\vu_a, \vu_b\right] = \osplit(\vu)$ \\ 
				$\left[\vs, \vt\right] = \neuralNet(\left[ \vu_b, \condneuralNet(\vc) \right] )$ \\
				$\vv = \left[(\vu_a - \vt) \oslash \vs, \vu_b \right] $}}  & \multirow{3}{*}{\cmark} & \multirow{3}{*}{\xmark} \\
		& & & & \\
		& & & & \\
		& & & & \\
		\multirow{2}{*}{\shortstack{Actnorm                \\ \citep{lu2020structured}}} & 
		\multirow{2}{*}{\shortstack{$\left[\vs, \vt\right] = \osplit (\condneuralNet(\vc))$ \\
				$\forall i,j: \vu_{i,j} = \vs \odot \vv_{i,j} + \vt$ }} &
		\multirow{2}{*}{\shortstack{$\left[\vs, \vt\right] = \osplit (\condneuralNet(\vc))$ \\
				$\forall i,j: \vv_{i,j} = (\vu_{i,j} - \vt) \oslash \vs$ }}             & \multirow{2}{*}{\cmark} & \multirow{2}{*}{\xmark} \\
		& & & & \\
		& & & & \\
		\multirow{2}{*}{\shortstack{Conv$^{-1}_{1\times1}$ \\ \citep{lu2020structured}}} &
		\multirow{2}{*}{\shortstack{$\mW_c = \condneuralNet(\vc)$ \\
				$\forall i,j: \vu_{i,j} = \mW_c \vv_{i,j}$ }} &
		\multirow{2}{*}{\shortstack{$\mW_c = \condneuralNet(\vc)$ \\
				$\forall i,j: \vv_{i,j} = \mW_c^{-1} \vu_{i,j}$ }}                      & \multirow{2}{*}{\cmark} & \multirow{2}{*}{\xmark} \\
		& & & & \\
		& & & & \\
		\midrule
		\multirow{3}{*}{\shortstack{RealNVP coupling\\ (ours)}} & 
		\multirow{3}{*}{\shortstack{$\left[\vv_a, \vv_b\right] = \osplit(\vv)$ \\ 
				$\left[\vs, \vt\right] = \neuralNet(\vv_b) + \condneuralNet(\vc)$ \\
				$\vu = \left[ \vs \odot \vv_a + \vt, \vv_b \right] $}} & 
		\multirow{3}{*}{\shortstack{$\left[\vu_a, \vu_b\right] = \osplit(\vu)$ \\ 
				$\left[\vs, \vt\right] = \neuralNet(\vu_b) + \condneuralNet(\vc)$ \\
				$\vv = \left[(\vu_a - \vt) \oslash \vs, \vu_b \right] $}}  & \multirow{3}{*}{\cmark} & \multirow{3}{*}{\cmark} \\
		& & & & \\
		& & & & \\
		& & & & \\
		\multirow{2}{*}{\shortstack{Actnorm\\ (ours)}} &
		\multirow{2}{*}{\shortstack{$\left[\vs, \vt\right] = \osplit ([\vs, \vt]_\vv + \condneuralNet(\vc))$ \\
				$\forall i,j: \vu_{i,j} = \vs \odot \vv_{i,j} + \vt$ }} &
		\multirow{2}{*}{\shortstack{$\left[\vs, \vt\right] = \osplit ([\vs, \vt]_\vv + \condneuralNet(\vc))$ \\
				$\forall i,j: \vv_{i,j} = (\vu_{i,j} - \vt) \oslash \vs$ }}             & \multirow{2}{*}{\cmark} & \multirow{2}{*}{\cmark} \\
		& & & & \\
		& & & & \\
		\multirow{2}{*}{\shortstack{Conv$^{-1}_{1\times1}$\\ (ours)}} &
		\multirow{2}{*}{\shortstack{$\mW_{g,c} = \mW_g + \mW_\condneuralNet(\vc)$ \\
				$\forall i,j: \vu_{i,j} = \mW_{g,c} \vv_{i,j}$ }} &
		\multirow{2}{*}{\shortstack{$\mW_{g,c} = \mW_g + \mW_\condneuralNet(\vc)$ \\
				$\forall i,j: \vv_{i,j} = \mW_{g,c}^{-1} \vu_{i,j}$ }}                  & \multirow{2}{*}{\cmark} & \multirow{2}{*}{\cmark} \\
		& & & & \\
		& & & & \\
		\bottomrule
	\end{tabular}
\end{table*}

Let's consider a practical setup in Figure~\ref{fig:problem} where a \textit{generalist} model $f_\vtheta$ is trained on large-scale data $\vv$ that incorporates general knowledge about the data distribution $p(\vv)$. Assume a task to implement a probabilistic classifier $p_{\vtheta}(y | \vv, \vc)$ or a conditional sampling task $p_{\vtheta}(\vv | \vu, \vc)$, where there is a context $\vc$ that incorporates an additional context-specific knowledge. Then, a set of \textit{specialists} can be learned using small-scale data from empirical distribution $p(\vv, \vc)$.

Conventional approach \citep{winkler2019learning} concatenates intermediate representations and the context $\vc$ inside the RealNVP coupling blocks \citep{45819}. This preserves RealNVP invertability, but limits the type of supported bijections. In addition, this method cannot fully support a generalist-specialist setup where the context vector $\vc$ is missing at the generalist learning phase and it is introduced only later for domain-specific specialist training. Hence, it is unable to explicitly decouple the general and domain knowledge that is desired for complexity optimizations \citep{hu2022lora} and practical applications \citep{Zhang_2023_ICCV}.

To address the above limitations, we approach the tasks in Figure~\ref{fig:problem} setup as follows. First, a generalist model is trained with large-scale dataset to approximate $p(\vv)$ \textit{without a-priori assumptions on the conditioning context}. Second, a set of specialist models with the defined context representations are learned with the \textit{fixed generalist parameters} using small-scale training sets for each specialist. Hence, we explicitly \textit{decouple the generalist knowledge and a set of context-specific specialists} in the proposed ContextFlow++ model. Practical contexts $\vc$ are usually represented by discrete or mixed-precision variables that poses a difficulty because conventional flow framework supports only continuous variables.
To overcome this difficulty, we use either embedding-based or variational dequantization methods implemented as sampling from an introduced encoding flow model. In summary, our contributions are as follows:
\begin{itemize}
	\itemsep0em
	\item We propose general approach to support additive context-conditioning for the generalist-specialist setup in bijective normalizing flow transformations.
	\item We address mixed-variable input data and contexts that are common in practical applications using the proposed ContextFlow++ architecture.
	\item Experiments show advantages of our ContextFlow++ approach in image classification, time series predictive maintenance and unsupervised anomaly detection.
\end{itemize}

\section{Related Work}\label{sec:related}
\textbf{Normalizing flow architectures.} Normalizing flows \citep{Kobyzev2020NormalizingFA, JMLR:v22:19-1028} largely develop in a direction of increasing their expressivity and, hence, improving density estimation or sampling. For example, recent works propose continuous flows with the mappings obtained by solving neural ordinary differential equation (ODE) \citep{node, grathwohl2018scalable} or process data with the manifold assumptions \citep{Postels, pmlr-v162-cunningham22a} or both \citep{chen2023riemannian}. Our work is complementary to these more advanced models since we only consider a problem of context-conditioning.

The Flow++ architecture \citep{flowpp} proposes bijective transformation that models cumulative distribution function of a mixture with the fixed number of components. Each component is a distribution parameterized by neural network outputs. Such approach is partially related to ours because it implicitly models a mixture of densities at each bijection. Another related work proposes semi-supervised learning setup using a latent Gaussian mixture (FlowGMM) \citep{izmailov2020semi}. We employ FlowGMM-type method to predict an outcome $p(y | \vv, \vc)$ for a discrete class $y$ that is independent of our context modeling goal.

\textbf{Context-conditioned flows.} In practice, the conditioning is very important feature in normalizing flow models, but research has been scarce in that area. The seminal works \citep{winkler2019learning, ardizzone2019guided} propose to concatenate internal representations for a specific type of flow bijections \ie the RealNVP couplings \citep{45819} with the invertible conditioning. \citet{lu2020structured} extends conditioning to Glow bijections \citep{NIPS2018_8224} \ie the conditional actnorm and Conv$^{-1}_{1\times1}$ layers, where conditioning is performed by a separate discriminative neural network applied to context vector.

The above approaches have been widely adopted in many popular applications. For example, super-resolution images with rescaling can be generated using hierarchical conditional flow model \citep{liang21hierarchical} with feature-extracted context vectors. Unsupervised anomaly detection with segmentation can be improved by conditioning using positional encoding vectors \citep{cflow_ad}. Time series forecasting is performed by conditioning flow model on the outputs of a recurrent network \citep{rasul2021multivariate} or by multi-scale transformer-based attention with positional encoding \citep{10265130}. Because the conventional conditioning methods cannot support generalist-specialist setup, we aim to introduce a \textit{generic and principled alternative} to the effective yet limited concatenation-style conditioning.

\textbf{Discrete distribution modeling.} In Figure~\ref{fig:problem} we can have two distinct cases: the context vector $\vc$ is represented by continuous variables or discrete variables. The former can be directly supported by the ContextFlow conditioning. However, the latter is more common in practice and requires additional (ContextFlow++) processing. 
Discrete densities can be converted to continuous ones by adding noise \citep{NIPS2013_53adaf49, Theis2015d} or by using variational dequantization \citep{flowpp}. The $\argmax$ variational method \citep{hoogeboom2021argmax} additionally compresses discrete variables. Recent Voronoi dequantization \citep{chen2022semidiscrete} learns quantization boundaries with an exact likelihood.

Another line of research processes discrete-only variables \citep{NIPS2019_9612} and models continuous to discrete mappings \citep{sidheekh2022vqflows}. In this paper, we support mixed-variable contexts by the conventional flow framework, where categorical variables are mapped to continuous ones and added to the overall context. More complex contexts such as relational graphs with linked discrete variables \citep{relbench} can be an avenue for future research.

\textbf{Discrete representations in other models.} In contrast to the specific task of discrete distribution modeling by normalizing flows in this paper, a general topic of using discrete representations has been pioneered by \citet{Bengio2013EstimatingOP}. It introduces the straight-through gradient estimator to learn discrete (quantized) representations in discriminative models. Then, the vector-quantized VAE (VQ-VAE) \citep{NIPS2017_7a98af17} with discretized encoder's latent space employs such estimator to avoid posterior collapse in the generative VAE model. Later, this approach has been widely applied to other generative models such as generative adversarial networks \citep{Esser_2021_CVPR} and diffusion models \citep{hoogeboom2021argmax}. Recent methods \citep{bond2021unleashing, Chang_2022_CVPR} rely on the transformer architecture to learn a codebook that is indexed by discrete indices, where it implements controllable data (\eg image) synthesis and manipulations. Though recent continuous normalizing flows \citep{lipman2023flow} can compete with the diffusion models in synthesis, we, for simplicity, consider a task of mixed-variable density estimation using finite normalizing flow architectures in this paper. In that task, VAEs and diffusion models are unable to estimate exact data likelihoods even for continuous data variables. 

\section{Preliminaries}\label{sec:theory}

\subsection{Normalizing flow framework}\label{sec:pcont}
Normalizing flows \citep{rezende15} can transform a target density $p_V$ of data vectors $\vv \in \gV = \sR^D$ to a base density $p_U$ with vectors $\vu \in \gU = \sR^D$ using the change-of-variable formula by \textit{bijective} and  \textit{differentiable} transformation $f : \gU \rightarrow \gV $ at any point as
\begin{equation} \label{eq:f1}
	p_{V}(\vv) = p_{U}(\vu) \left| \det \partial \vu / \partial \vv^{T} \right|,~\textrm{and}~\vu = f^{-1}_{\vtheta}(\vv),
\end{equation}
where a base random variable $\vu$ can be from a standard Gaussian or a parameterized distribution. The normalizing flow model $f_{\vtheta}$ with $\vtheta$ parameters is typically implemented as a sequence of tractable transformations.

The closed-form expression in (\ref{eq:f1}) allows to learn exact density functions. However, this conventional flow framework is restricted to only \textit{continuous} $p_{V} (\vv)$ densities.

\subsection{Dequantization methods}\label{sec:pdeq}
To learn densities $P(\vx)$ of \textit{discrete} variables, it is common to apply a \textit{surjective} \citep{nielsen2020survae} transformation $g : \gX \rightarrow \gV $ that is deterministic in one direction $(\vx = g^{-1}_{\vlambda}(\vv))$ and stochastic in the other $(\vv \sim q_\vlambda(\vv | \vx))$ using a dequantization distribution $q_\vlambda(\vv | \vx)$ with parameters $\vlambda$~\ie~the dequantization model. Then, the discrete density can be written using Dirac $\delta$-function as
\begin{equation} \label{eq:d1}
	P (\vx) = \int P (\vx | \vv) p(\vv) d\vv, P (\vx | \vv) = \delta (\vx = g^{-1}_{\vlambda}(\vv)).
\end{equation}

A \textit{surjective} encoder $g_{\vlambda}(\vx)$ in (\ref{eq:d1}) estimates the evidence lower bound (ELBO) of $q_\vlambda(\vv | \vx)$ as
\begin{equation} \label{eq:d2}
	\log P_\vlambda (\vx) \ge \E_{\vv \sim q_\vlambda(\vv | \vx)} \left[ \log p(\vv) - \log q_\vlambda (\vv | \vx) \right],
\end{equation}
where the ELBO holds for the support $\gS=\{ \vv \in \R^D : \vx = g^{-1}_{\vlambda}(\vv)\}$ such that $P (\vx | \vv) = 1$ in that support region.

Typical surjection choice is the rounding operation $\lfloor \vv \rfloor$ \citep{NIPS2013_53adaf49, Theis2015d, flowpp}, $\argmax (\vv)$ operation \citep{hoogeboom2021argmax} or a set identification function $\R^D \rightarrow \{1,\dots,K\}$ \citep{chen2022semidiscrete}.

Then, \textit{discrete variables} can be processed using the (\ref{eq:d1},\ref{eq:d2}) generic framework by choosing the appropriate $g_{\vlambda}$ with a corresponding dequantization model $q_\vlambda(\vv | \vx)$ at the expense of the ELBO estimate rather than the exact likelihood. Dequantization model can simply add noise from uniform distribution \citep{NIPS2013_53adaf49, Theis2015d}, or samples from a flow model  that implements variational distribution $q_\vlambda(\vv | \vx)$ \citep{flowpp, hoogeboom2021argmax}. Unlike the variational approaches with the ELBO estimate of discrete density, the use of disjoint subsets provides an exact likelihood \citep{chen2022semidiscrete}.

\subsection{Conventional conditional flows}\label{sec:pcond}
In the conditional setting, there is an additional context vector $\vc \in \gC$ with the $ p_C $ density. Often, the context is given by discrete variables which can be addressed by Section~\ref{sec:pdeq} methods and the proposed in Section~\ref{sec:prop} framework.

Then, assuming continuous data and context vectors, we can rewrite (\ref{eq:f1}) for the joint $\log$-likelihood as
\begin{equation} \label{eq:c1}
	\log p_{\vtheta} (\vv, \vc) = \log p_{\vgamma} (\vu) + \sum\nolimits^L_{l=1} \log \left| \det \mJ_l \right|,
\end{equation}
where $\vu = f^{-1}_{\vtheta}(\vv;\vc)$, the Jacobian matrices $\mJ_l$ are sequentially calculated for the $l^{\textrm{th}}$ transformation $f^{-1}_{\vtheta_l}$, and $\vgamma$ represents parameters of the base distribution.

Next, we formally summarize previously proposed conditional bijections for (\ref{eq:c1}) in Table~\ref{tab:prior} (top). Currently, they are limited to RealNVP coupling bijections, Glow's activation normalization and Conv$^{-1}_{1\times1}$ layers. A conditional neural network (CN) with the context input estimates bijection parameters. These parameters are either directly used (Actnorm and Conv$^{-1}_{1\times1}$) or concatenated with the intermediate vectors (RealNVP) to calculate bijection's output. 

Conventional methods are viable when assumptions about $\vc$ are known and there is access to $p(\vv, \vc)$ data. However, often practitioners employ context information after learning on $p(\vv)$ data. For example, a pretraining step with large-scale data can be performed to extract \textit{general} knowledge followed by context-conditional \textit{specialist} training with fixed generalist parameters. Moreover, in some situations we cannot anticipate what type of context information can be useful for a task or metric. Hence, we propose a framework where data and context modeling is explicitly decoupled.

\section{Proposed Method}\label{sec:prop}

\subsection{Additive context for specialists}\label{sec:additive}
Let's consider a modified setup for the (\ref{eq:c1}) task, where the generalist model $f^{-1}_{\vtheta_g}$ is pretrained using the conventional objective (\ref{eq:f1}) to estimate $\log p_{\vtheta_g} (\vv)$. We are interested in improving the generalist likelihood estimates for each specific context without modifying its $\vtheta_g$ parameters.

With the exception of masked autoregressive flows \citep{maf}, it is common to model elements $v_i$ of an input data vector $\vv= [ v_1, \ldots v_D ]$ as independent variables in a single bijection layer \citep{JMLR:v22:19-1028}. Then, it is presumed that a sufficient number of bijection layers with $v_i$ permutations lead to an accurate joint likelihood estimate for $\vv$. Similarly, we can assume that the context vector $\vc$ is independent of data vector $\vv$ within a bijection and can rewrite joint $\log$-likelihood (\ref{eq:c1}) as a sum
\begin{equation} \label{eq:a1}
	\log p_{\vtheta_{g,c}} (\vv, \vc) = \log p_{\vtheta_g} (\vv) + \log p_{\vtheta_c} (\vc),
\end{equation}
where the generalist parameters $\vtheta_g$ are fixed after the pretraining and only the specialist parameters $\vtheta_c$ are learned.

\begin{figure}[t!]
	\centering
	\includegraphics[width=0.9\columnwidth]{figs/fig-cond-scheme.pdf}
	\caption{Our high-level scheme. Mixed-variable inputs and contexts are represented by vectors $\vx_{g,c}^{\sR,\sZ}$. First, the data encoder $g_{\vlambda_g}$ and decoder $f^{-1}_{\vtheta_g}$ are learned during large-scale generalist pretraining step. Next, the specialist context encoder $g_{\vlambda_c}$ and the extended decoder parameters $f^{-1}_{\vtheta_c}$ are learned with small-scale data. Generative encoders convert discrete variables into continuous data $\vv$ and context $\vv$ vectors. A distributional model $h(\vgamma_{g,c})$ also supports such two-step training and outputs likelihood $p(y | \vx)$ estimates.}
	\label{fig:scheme}
\end{figure}

With the (\ref{eq:a1}) assumption for (\ref{eq:c1}), we propose a density transformation approach with additive $\log$-likelihood contributions. In particular, consider a generalized bijection
\begin{equation} \label{eq:a2}
	\vu = \mM \left(\mW_g + \mW_c \right) \vv,~\vv = \left(\mM \left(\mW_g + \mW_c \right)\right)^{-1} \vu,
\end{equation}
where matrices $\mW_{g,c}$ are parameters for any flow type \citep{Kobyzev2020NormalizingFA} \eg element-wise or linear. RealNVP and autoregressive couplings with appropriate binary mask $\mM$ can be implemented with $\mW_g = \neuralNet(\vv)$ and $\mW_c = \condneuralNet(\vc)$. Since (\ref{eq:a2}) implements a linear combination, its Jacobian terms are additive $\partial \vu / \partial \vv^T = \mM \mW_g + \mM \mW_c$ with decoupled likelihood contributions $\left| \det \mJ_g \right|$ and $\left| \det \mJ_c \right|$.

\begin{figure*}[]
	\centering
	\includegraphics[width=0.7\textwidth]{figs/fig-cond-arch.pdf}
	\caption{Our detailed ContextFlow++ architecture with mixed-variable data and context encoders $g_{\vlambda_{g,c}}$ that are implemented as a sampling from the surjective flow model with various discrete-variable mapping and distribution options. The bijective flow decoder performs likelihood estimation using the encoder's $\vv$ input during generalist training step with $\vtheta_g$ parameters. Then, it is followed by context-specific specialist training with $\vtheta_c$ parameters using sampled contexts $\vc$. A distributional model $h(\vgamma_{g,c})$ implements task's probabilistic classifier and outputs $p(y | \vx)$ likelihood estimates.}
	\label{fig:arch}
\end{figure*}

We apply our additive context approach to several common bijections using (\ref{eq:a2}). Table~\ref{tab:prior} (bottom) contains examples for Glow-type finite flows. The main difference between previous and our transformations is the \textit{explicit separation of generalist and specialist processing}.

\subsection{Encoding mixed-variable data}\label{sec:mixv}
As discussed in Section\ref{sec:pdeq} input data can be heterogeneous \ie represented by continuous ($\vx^\sR \in \sR$) or discrete ($\vx^\sZ \in \sZ$) variables. This is especially relevant to the context vectors which often contain information represented by integers such as user preferences, sensor configurations, product's geographical location \etc~Therefore, we propose to extend conventional normalizing flow architecture (bijective decoder) by an additional encoding step as shown in Figure~\ref{fig:scheme} scheme. The encoding step for continuous inputs is optional and can contain data preprocessing \eg normalization. However, it is essential for discrete inputs to use either embedding-based \citep{gorishniy2022on} mappings or one of the dequantization methods from Section~\ref{sec:pdeq}.

In this paper, we also investigate trade-offs of various methods for mapping discrete variables to continuous ones when applied to probabilistic flow framework. First, a common practical solution is to use a differentiable embedding to look up a learnable continuous-space vector by a discrete index. There are many variants of this approach as described in \citep{gorishniy2022on}. This simple deterministic lookup method can be extended to a stochastic sampling from a learnable distribution $q_\vlambda(\vv | \vx)$. The latter can be seen as a special case of dequantization.

Second, low-complexity uniform \citep{NIPS2013_53adaf49, Theis2015d} dequantization is a popular choice for images and audio sequences. It typically works well for discrete data with relatively high cardinality (\eg 8-bit variables have cardinality of 256). Thus, we adopt uniform dequantization in our experiments for input data to avoid high complexity.

Variational dequantization methods with rounding \citep{flowpp} and $\argmax$ operations \citep{hoogeboom2021argmax} offer more accurate parametric mappings for low-cardinality categorical data. Let the $i^\textrm{th}$ variable $\vx^\sZ_i \in \sZ_i = \{ 1, 2, \ldots, K_i \}^D$ represent a discrete vector with $K_i$ categories. If drop the index $i$ for convenience, $\vx$ is the input to $g_\vlambda$ parametric encoder. The encoder implements a \textit{surjective} mapping $g : \sZ \rightarrow \sR$ between discrete $\vx$ and continuous $\vv$. Then, our encoder outputs $\vv \in \sR^{D}$ for variational dequantization and $\vv \in \sR^{D \times K}$ for the $\argmax$ method.

Furthermore, we experiment with the variational method with rounding that maps $\vx$ to one-hot binary representation $\vv \in \sR^{D \times K}$ \citep{gorishniy2022on}. Additionally, a na\"ive implementation of the $\argmax$ approach has significant complexity due to large $\vv \in \sR^{D \times K}$ vectors for each $\vx^D$. To reduce complexity, we apply Cartesian product compression as in \citep{hoogeboom2021argmax}. Then, the number of dimensions to encode each discrete variable is the lowest for $\log_2$ (binary) representation. This approach encodes categorical discrete variables to $\vv \in \sR^{D \times \log_2 K}$ outputs.

To summarize, we propose a mixed-variable probabilistic architecture to support various kinds of input and context data with details shown in Figure~\ref{fig:arch}. Our encoding step for discrete variables can be implemented with different types of vector mappings followed by several dequantization and embedding-based methods as described above. Effectively, we \textit{generalize the encoder as a surjective normalizing flow model} using stochastic right inverse $g_\vlambda$. Therefore, the encoder implements variational methods by sampling from the parameterized distribution $q_\vlambda(\vv | \vx)$ and, additionally, can contain flow's transformations to be more expressive when generating continuous variables. At the same time, our ContextFlow++ decoder extends the conventional bijective flow model that performs deterministic inverse $f^{-1}_\theta$.

\subsection{Overall architecture}\label{sec:arch}
Section~\ref{sec:mixv} gives details about mapping and $q_\vlambda(\vv | \vx)$ sampling variants in our context encoder presented in Figure~\ref{fig:arch}. In addition, our context encoder as well as the decoder contains $B$ bijective blocks. Each block consists of a \textit{squeeze} layer, a sequence of $L$ sub-blocks and an optional \textit{split prior} layer \citep{nielsen2020survae}. The \textit{squeeze} layer reduces each spatial or temporal dimension by a factor of 2 and, correspondingly, increases data dimensions. The split prior layer reduces data dimensions by a factor of two and applies a distributional model for a half of them, which is applied only to CIFAR-10C and ATM datasets in Section~\ref{sec:eval}.

Each of $L$ sub-blocks contains our modified Glow-type transformations from Table~\ref{tab:prior}. To be precise, sub-blocks function as the conventional transformations at generalist training step. Then, the context-specific processing is added during specialist learning step, while the generalist parameters are fixed. Our neural networks in the coupling layers have an option to be convolutional or, optionally, have the ViT transformer \citep{dosovitskiy2021an}. We also permute data and temporal axes for the time series ATM dataset in Section~\ref{sec:eval} experiments using the \textit{permute axes} layer.

Our distributional model $h(\vgamma_{g,c})$ with diagonal Gaussian base distribution implements the FlowGMM-style probabilistic classifier \citep{izmailov2020semi, flowenedet} with 8 mixture components and $M$ classes for each outcome $y=m~(m = 1 \ldots M)$ and a corresponding set of learnable parameters: means, variances and weights. It also supports separate modeling of generalist and context-specific distributions using two sets of the above parameters. The distributional model outputs $p(\vx, \vc | y)$ likelihood estimates that are used in the loss function. To support semi-supervised setting, we use the loss that consists of supervised cross-entropy and unsupervised terms expresses by
\begin{equation} \label{eq:l1}
\begin{split}
	\mathcal{L} = -\frac{1}{|\sN|} \sum\nolimits_{i\in\sN} & [ \log \mathrm{softmax}~\log p(\vx_i, \vc_i | y_i = m ) \\
	& + \alpha \log \sum\nolimits_{m} p(\vx_i, \vc_i | y_i = m ) ],
\end{split}
\end{equation}
where $\sN$ is the training set, $\mathrm{softmax}$ computes classifier's predictions $p(y_i | \vx_i, \vc_i)$ and the hyperparameter $\alpha=$1e-3. The first term in (\ref{eq:l1}) is omitted in the unsupervised experiments ($M=1$) and only the last term with $\alpha=$1 is retained.

\begin{table*}[t]
	\caption{Small-scale image classification benchmark using MNIST-R with 64 rotations. Each rotation represents a conditioning context. 
		The \fst{best} and the \scd{second best} top-1 accuracy ($\mu_{\pm\sigma}$, \%) results are highlighted. The generalist model experiences 2.8 \pp~accuracy drop when adding image rotations. 
		The prior context-conditioned model \citep{lu2020structured} trained from scratch and our ContextFlow++ trained with the fixed generalist parameters show similar accuracy gains.}
	\label{tab:mnistr}
	\centering
	\small
	\begin{tabular}{c|c|cc|cc|cc}
		\toprule
		\multirow{2}{*}{\shortstack{Context Encoder $\rightarrow$ \\ Model $\downarrow$}} & \multirow{2}{*}{\shortstack{Fixed\\Generalist}} & \multicolumn{2}{c|}{Integer} & \multicolumn{2}{c|}{One-hot binary} & \multicolumn{2}{c}{Learned embedding} \\
		& & uniform & $\argmax$ & uniform & variational & deterministic & stochastic  \\
		\midrule
		Generalist$_{\textrm{with rot.}}$ &        & \multicolumn{6}{c}{w/o $\rightarrow$ with rotations: 98.9\tiny$\pm$0.1~\small~$\rightarrow$~96.1\tiny$\pm$0.2~\small~(2.8 \pp~drop)}   \\
		\citet{lu2020structured}          & \xmark & 97.6\tiny$\pm$0.1 & \scd{97.7}\tiny$\pm$0.1 & 97.4\tiny$\pm$0.1 & \fst{97.8}\tiny$\pm$0.1 & \scd{97.7}\tiny$\pm$0.1 & \fst{97.8}\tiny$\pm$0.1 \\
		ContextFlow++ (ours)              & \cmark & 97.7\tiny$\pm$0.1 & \scd{97.8}\tiny$\pm$0.1 & 97.6\tiny$\pm$0.1 & \fst{97.9}\tiny$\pm$0.1 & \fst{97.9}\tiny$\pm$0.1 & \scd{97.8}\tiny$\pm$0.1 \\
		\bottomrule
	\end{tabular}
\end{table*}

\begin{table*}[t]
	\caption{Larger-scale image classification benchmark using CIFAR-10C with corruptions. Corruption type and its severity define 2-dimensional conditioning context. 
		The \fst{best} and the \scd{second best} top-1 accuracy ($\mu_{\pm\sigma}$, \%) results are highlighted. The generalist model experiences 6.6 \pp~accuracy drop when adding image corruptions. 
		The prior method \citep{lu2020structured} cannot surpass even the generalist results. Our ContextFlow++ with the fixed general knowledge show higher classification accuracy, in particular, with more advanced context encoders \ie variational and embedding-based.}
	\label{tab:cifarc}
	\centering
	\small
	\begin{tabular}{c|c|cc|cc|cc}
		\toprule
		\multirow{2}{*}{\shortstack{Context Encoder $\rightarrow$ \\ Model $\downarrow$}} & \multirow{2}{*}{\shortstack{Fixed\\Generalist}} & \multicolumn{2}{c|}{Integer} & \multicolumn{2}{c|}{One-hot binary} & \multicolumn{2}{c}{Learned embedding}\\
		& & uniform & $\argmax$ & uniform & variational & deterministic & stochastic  \\
		\midrule
		Generalist                        &        & \multicolumn{6}{c}{w/o $\rightarrow$ with corruptions: 61.7\tiny$\pm$1.3~\small~$\rightarrow$~55.1\tiny$\pm$0.3~\small~(6.6 \pp~drop)}   \\
		\citet{lu2020structured}          & \xmark & 49.0\tiny$\pm$0.9 & \scd{51.2}\tiny$\pm$0.4 & 48.3\tiny$\pm$2.6 & 50.8\tiny$\pm$0.5 & \fst{52.5}\tiny$\pm$0.4 & 50.8\tiny$\pm$0.7 \\
		ContextFlow++                     & \cmark & 56.5\tiny$\pm$0.3 & 57.1\tiny$\pm$0.4 & 57.3\tiny$\pm$0.4 & \fst{57.7}\tiny$\pm$0.5 & \scd{57.4}\tiny$\pm$0.3 & 56.8\tiny$\pm$0.3 \\
		\bottomrule
	\end{tabular}
\end{table*}

\section{Experiments}
\label{sec:eval}
\subsection{Experiment Setup}
\label{subsec:setup}
\textbf{Benchmarks.} Though recent continuous flows \citep{lipman2023flow} can compete with the diffusion models in $p(\vx | \vu)$ sampling or can be a latent-space component in the sampling pipeline \citep{Davtyan_2023_ICCV}, we are mostly interested in modeling $p(y | \vx, \vc)$ predictions using well-established finite flow architectures \citep{NIPS2018_8224} from Table~\ref{tab:prior}. Particularly, we experiment with the discrete contexts and the generalist-specialist setup.

Hence, we select four benchmarks. First, we modify small-scale MNIST classification with $M=10$ classes by applying $c \sim \mathcal{U}\{0, 63\}$ random image rotations with $360^\circ/64$ discrete steps to all data splits. Such rotated MNIST-R defines a simple yet challenging task for conventional architectures without inherent rotational invariance property.

Second larger-scale image classification benchmark is the widely-used CIFAR-10C \citep{hendrycks2018benchmarking} with synthetic corruptions. We define 2-dimensional context vector in CIFAR-10C as $\vc \sim \left[ \mathcal{U}\{1, 15\}, \mathcal{U}\{1, 5\} \right]$ that models discretely sampled image corruption type (15) and its severity level (5), respectively. When applied to image classification, CIFAR-10C corruptions usually cause a significant drop in the prediction accuracy.

Lastly, we employ two real-world time series benchmarks: supervised ATM machine failure prediction \citep{vargas2023hybrid} and SMAP unsupervised anomaly detection \cite{smap}. ATM dataset contains 29,386 sequences collected from 68 deployed ATM machines, where each 144-length sequence has 38 data dimensions. The task is to predict an ATM failure in one-week time frame using binary labels ($M=2$). Then, we use ATM machine ID as a discrete context. Second, the soil moisture active passive satellite (SMAP) dataset contains soil samples and telemetry information from the Mars rover with 135,183 and 427,617 data points in the training (without anomalies $M=1$) and test sets, respectively. SMAP data has 25 data dimensions collected from 55 entities. We use the entity ID as a discrete context for our ContextFlow++. We follow \citet{omnianomaly} and transform the regression task into a classification task using sliding windows (window size = 8) and replication padding \cite{tranad}. Both datasets are imbalanced with $\approx 10\%$ of positive (failure or anomaly) labels.

% NEW with fixed splits:
\begin{table*}[h]
	\caption{Real-world ATM machine failure prediction with time series sensory data \citep{vargas2023hybrid}. Machine IDs define the conditioning context. The reference flow-based generalist model outperforms other baseline models. Our ContextFlow++ further improves performance metrics. The variational and deterministic embedding-based context encoders achieve the highest metrics. The \fst{best} and the \scd{second best} metric's ($\mu_{\pm\sigma}$, \%) results are highlighted.}
	\label{tab:comp_atm_full}
	\centering
	\small
	\begin{tabular}{c|ccc|c|c|ccc}
		\toprule
		\multirow{2}{*}{\shortstack{Model \\ Metric $\downarrow$}} &
		\multirow{2}{*}{\shortstack{random \\ forests}} &
		\multirow{2}{*}{\shortstack{HYDRA}} & 
		\multirow{2}{*}{\shortstack{XGBoost}} &
		\multirow{2}{*}{\shortstack{Embed. determ. \\ \citet{lu2020structured}}} &
		\multirow{2}{*}{\shortstack{ContextFlows $\rightarrow$ \\ Generalist Flow $\downarrow$}} &
		\multirow{2}{*}{\shortstack{Integer \\ $\argmax$}} & 
		\multirow{2}{*}{\shortstack{One-hot \\ variational}} & 
		\multirow{2}{*}{\shortstack{Embed. \\ determ.}} \\
		& & & & & & &  & \\
		\midrule
		Accuracy  & 94.39\tiny$\pm$0.6 & 81.3\tiny$\pm$0.5 & 96.7\tiny$\pm$0.2 & 96.7\tiny$\pm$0.3 & 97.2\tiny$\pm$0.5 & 98.1\tiny$\pm$0.3 & \scd{98.3}\tiny$\pm$0.2 & \fst{98.5}\tiny$\pm$0.1 \\
		Bal. Acc. & 73.71\tiny$\pm$3.0 & 74.1\tiny$\pm$0.4 & 85.3\tiny$\pm$1.2 & 91.9\tiny$\pm$1.6 & 91.5\tiny$\pm$1.6 & \scd{95.1}\tiny$\pm$1.1 & 95.0\tiny$\pm$0.6 & \fst{95.9}\tiny$\pm$0.2 \\
		AuROC     & 73.71\tiny$\pm$3.0 & 74.1\tiny$\pm$0.4 & 85.3\tiny$\pm$1.2 & 99.0\tiny$\pm$0.2 & 98.7\tiny$\pm$0.2 & \scd{99.4}\tiny$\pm$0.2 & \scd{99.4}\tiny$\pm$0.2 & \fst{99.6}\tiny$\pm$0.1 \\
		AP        & 51.73\tiny$\pm$5.5 & 23.8\tiny$\pm$0.5 & 71.0\tiny$\pm$1.9 & 93.2\tiny$\pm$1.4 & 92.7\tiny$\pm$1.5 & 96.0\tiny$\pm$0.8 & \scd{97.0}\tiny$\pm$0.3 & \fst{97.1}\tiny$\pm$0.5 \\
		F$_1$     & 63.68\tiny$\pm$5.3 & 42.0\tiny$\pm$0.7 & 81.5\tiny$\pm$1.5 & 84.3\tiny$\pm$1.6 & 86.3\tiny$\pm$2.7 & 90.9\tiny$\pm$1.4 & \scd{91.9}\tiny$\pm$0.8 & \fst{92.6}\tiny$\pm$0.6 \\
		MS        & 47.59\tiny$\pm$6.1 & 65.0\tiny$\pm$0.7 & 70.9\tiny$\pm$2.4 & 85.9\tiny$\pm$3.4 & 84.4\tiny$\pm$3.0 & \scd{91.2}\tiny$\pm$2.3 & 90.9\tiny$\pm$1.4 & \fst{92.6}\tiny$\pm$0.4 \\
		\bottomrule
	\end{tabular}
\end{table*}
% subsampled NEW with fixed splits:
\begin{table*}[h]
	\caption{Subsampled ATM machine failure prediction benchmark with the increased to $100\times$ positive/negative data imbalance. As a result, the performance gaps between ContextFlow++ specialists and other models are also increased. Unlike the previous setup, the variational context encoder outperforms the deterministic embedding-based encoder, which highlights advantages of a more robust fully-probabilistic approach in real-world applications.} %The \fst{best} and the \scd{second best} metric's ($\mu_{\pm\sigma}$, \%) results are highlighted.}
	\label{tab:comp_atm_semi}
	\centering
	\small
	\begin{tabular}{c|ccc|c|c|ccc}
		\toprule
		\multirow{2}{*}{\shortstack{Model \\ Metric $\downarrow$}} &
		\multirow{2}{*}{\shortstack{random \\ forests}} &
		\multirow{2}{*}{\shortstack{HYDRA}} & 
		\multirow{2}{*}{\shortstack{XGBoost}} &
		\multirow{2}{*}{\shortstack{Embed. determ. \\ \citet{lu2020structured}}} &
		\multirow{2}{*}{\shortstack{ContextFlows $\rightarrow$ \\ Generalist Flow $\downarrow$}} &
		\multirow{2}{*}{\shortstack{Integer \\ $\argmax$}} & 
		\multirow{2}{*}{\shortstack{One-hot \\ variational}} & 
		\multirow{2}{*}{\shortstack{Embed. \\ determ.}} \\
		& & & & & & & & \\
		\midrule
		Accuracy  & 91.56\tiny$\pm$0.4 & 78.5\tiny$\pm$1.6 & \fst{93.0}\tiny$\pm$0.2 & 90.9\tiny$\pm$1.1 & 90.9\tiny$\pm$0.9 & 91.9\tiny$\pm$0.7 & 92.3\tiny$\pm$0.9 & \scd{92.9}\tiny$\pm$0.5 \\
		Bal. Acc. & 59.66\tiny$\pm$1.7 & 66.2\tiny$\pm$1.1 & 66.9\tiny$\pm$0.9       & 73.3\tiny$\pm$1.7 & 73.6\tiny$\pm$2.1 & 75.3\tiny$\pm$1.8 & \fst{77.7}\tiny$\pm$1.1 & \scd{76.1}\tiny$\pm$3.1 \\
		AuROC     & 59.66\tiny$\pm$1.7 & 66.2\tiny$\pm$1.1 & 66.9\tiny$\pm$0.9       & 83.8\tiny$\pm$1.2 & 84.0\tiny$\pm$1.6 & 84.6\tiny$\pm$0.7 & \fst{86.2}\tiny$\pm$1.1 & \scd{85.2}\tiny$\pm$0.9 \\
		AP        & 27.50\tiny$\pm$3.0 & 17.6\tiny$\pm$0.9 & 40.1\tiny$\pm$1.6       & 56.0\tiny$\pm$4.1 & 57.1\tiny$\pm$4.5 & 61.4\tiny$\pm$2.5 & \scd{64.8}\tiny$\pm$3.8 & \fst{64.9}\tiny$\pm$3.4 \\
		F$_1$     & 32.23\tiny$\pm$4.7 & 33.0\tiny$\pm$1.6 & 50.3\tiny$\pm$2.1       & 54.0\tiny$\pm$2.6 & 54.3\tiny$\pm$4.1 & 58.3\tiny$\pm$2.7 & \fst{61.8}\tiny$\pm$2.8 & \scd{61.4}\tiny$\pm$4.0 \\
		MS        & 19.36\tiny$\pm$3.4 & 50.6\tiny$\pm$2.2 & 33.8\tiny$\pm$1.9       & 51.1\tiny$\pm$4.3 & 51.9\tiny$\pm$3.8 & 54.4\tiny$\pm$4.0 & \fst{59.1}\tiny$\pm$2.0 & \scd{54.9}\tiny$\pm$6.6 \\
		\bottomrule
	\end{tabular}
\end{table*}

\textbf{Flow models.} We experiment with the Glow-type models from Table~\ref{tab:prior} with the following dequantization. First, we always apply low-complexity uniform dequantization to the $\vx$ inputs. Second, we employ Section~\ref{sec:mixv} surjective context encoders for conditioning. Particularly, we experiment with the following context encoders: with uniform \citep{Theis2015d} and variational dequantization methods \citep{flowpp, hoogeboom2021argmax} as well as trainable embedding-based deterministic and stochastic encoders using the library from \citet{gorishniy2022on}. Also, we apply dequantization methods both to the original integer contexts and to their one-hot binary representations. Context representation is important due to computational complexity and dequantization considerations. For example, the $\argmax$ method is only applicable to integer contexts, while variational \citep{flowpp} approach is well-suited for binary one-hot representation. We apply the same flow architecture in all benchmarks with variable number of blocks and sub-blocks as presented in Section~\ref{sec:arch}. We select (number of blocks $B$, and sub-blocks $L$) as (2,2) for MNIST-R, (3,4) for CIFAR-10C and ATM, (2,4) for SMAP, respectively. We apply convolutional couplings in MNIST-/CIFAR-10C image classification datasets and transformer-based couplings in ATM/SMAP time series datasets.

\textbf{Training hyperparameters.} We train the generalist and the conventional specialist models \citep{lu2020structured} from the scratch for each benchmark. Then, we train our ContextFlow++ model with the pretrained generalist parameters. Since ContextFlow++ explicitly decouples the general and context-specific knowledge, there are two sets of parameters: one fixed set inherited from the generalist and a learnable set for additive conditioning in the context encoder.

Each model is optimized with the following hyperparameters: AdamW optimizer with 256-size batches and initial 1e-3 learning rate, which is reduced by a factor of 10 every 12 epochs with 48 epochs in total. A warm-up phase with the learning rate gradually increasing from 1e-4 to 1e-3 is applied during first 4 epochs. %The learning rate have been selected from the \{1e-2, 1e-3, 1e-4\} range.

\textbf{Evaluation.} We use top-1 accuracy metric for MNIST-R and CIFAR-10C classification tasks. In addition, the standardized metrics from \citet{vargas2023hybrid} are used for ATM failure prediction: balanced and unbalanced top-1 accuracies, area under the receiver operating characteristic curve (AuROC), average precision (AP), F$_1$-score \citep{f1} and minimum sensitivity (MS). We also rely on F$_1$-score to compute a binary prediction threshold in ATM. We follow \citet{omnianomaly} and report precision (P), recall (R), AuROC and F$_1$ score for the SMAP dataset.

We run each experiment four (MNIST-R, CIFAR-10C, SMAP) or five (ATM) times and report the metric's mean ($\mu$) and, if shown, standard deviation ($\pm \sigma$) on the test split. Unlike other datasets with the fixed training/test splits, we perform 5-fold cross-validation splits with a single seed (2) for ATM. Note that unlike \citep{vargas2023hybrid}, we do not perform context-stratified splits to have overlapping contexts in training/test splits. The latter choice increases performance metrics that have been reported in their paper.

\subsection{Quantitative Results}\label{subsec:quant_eval}
\textbf{MNIST-R classification.} We report classification results using selected baselines and our ContextFlow++ variants in Table~\ref{tab:mnistr}. As expected the generalist model with image rotations in the data splits has 2.8 percentage points~(\pp)~lower accuracy results because the same-size model without rotational invariance cannot be as successful in approximating larger data distribution. With the proposed ContextFlow++, we lower that accuracy gap to 1.0 \pp~The deterministic embedding-based method and variational dequantization variants have the highest performance metrics.

When compared to the conventional baseline~\citep{lu2020structured} results, our context-conditioned variants improve classification accuracy by only 0.1-0.2 \pp~which signals about lack of useful general knowledge in MNIST-R. Another interpretation can be a relatively simple MNIST classification task with saturated accuracy metrics.

\textbf{CIFAR-10C classification.} Table~\ref{tab:cifarc} presents the same baselines but with very different outcome. First, overall accuracy is significantly lower (61.7\%) and image corruptions increase accuracy gap between the models trained and evaluated on the undistorted CIFAR-10 and the corrupted CIFAR-10C to 6.6 \pp~(61.7\% vs. 55.1\%). 

Second, the conventional conditioning approach is unable to surpass even the generalist model results. At the same time, the proposed ContextFlow++ converges well because the general knowledge is preserved in the fixed generalist parameters, where it leads to 2.6 \pp~(57.7\% vs. 55.1\%) higher accuracy. The best results are, again, achieved with deterministic embedding-based encoder and variational dequantization variants with one-hot binary context representation and $\argmax$ method with $\log_2$ context compression.

\textbf{ATM failure prediction.} We reproduce \citet{vargas2023hybrid} baselines in Table~\ref{tab:comp_atm_full} using their public code but with the modified data splits. Particularly, we evaluate classic non-temporal machine learning methods: random forests \citep{breiman2001random} and XGBoost \citep{XGBoost}. The HYDRA model \citep{hydra} is a hybrid method with convolutional neural network (CNN) for feature extraction with temporal processing followed by the ridge classifier \citep{pedregosa2011scikit}.

We report flow-based generalist model and the best ContextFlow++ variants in Table~\ref{tab:comp_atm_full}. It is clear that even the generalist model significantly outperforms all baselines from \citet{vargas2023hybrid} and our ContextFlow++ further improves failure prediction metrics. For example, ContextFlow++ with more advanced context encoders achieve the highest results and provides up to 6.3 \pp~additional F$_1$ score gain when compared to the generalist model.

To highlight the robustness of our probabilistic models, we conduct additional experiments where we subsample number of positive (failure) data points. Table~\ref{tab:comp_atm_semi} shows results where imbalance between positive and negative examples is increased from $10\times$ to $100\times$ by training set subsampling. With the subsampled ATM, we have two important observations. First, the gaps in metrics between the best ContextFlow++ models and other baselines increase by 1-2 \pp~Second, deterministic embedding-based approach does not perform as good as with MNIST-R, CIFAR-10C and the original ATM data. At the same time, the variational dequantization has the highest overall scores. Then, a fully-probabilistic model (including the context encoder) can be more robust when applied to real-world application settings.

\begin{table}[t!]
	\caption{Unsupervised anomaly detection on real-world SMAP dataset with time series sensory data \citep{smap}. Entity IDs (55) define the conditioning context. Unlike ContextFlow++ with a single model (\# = 1), conventional baselines train and evaluate on a separate model for each entity  (\# = 55). Our ContextFlow++ significantly improves anomaly detection precision (P) and, hence, the F$_1$ score, while recall (R) and AuROC scores are saturated as in other baselines. The \fst{best} and the \scd{second best} metric's results, if metric is not saturated, are highlighted, \%.}
	\label{tab:smap}
	\centering
	\small
	\begin{tabular}{c|c|cccc}
		\toprule
		Model & \#  & P & R & AuROC & F$_1$ \\
		\midrule
		OmniAnom.   & 55 &       81.30 & 94.19       &       98.89 & 87.28       \\
		MTAD-GAT    & 55 &       79.91 & 99.91       &       98.44 & 88.80       \\
		CAE-M       & 55 &       81.93 & 95.67       &       99.01 & 88.27       \\
		GDN         & 55 &       74.80 & 98.91       &       98.64 & 85.18       \\
		TranAD      & 55 &       80.43 & 99.99       &       99.21 & \scd{89.15} \\
		\midrule
		Generalist   & 1 & \scd{87.40} & 84.93       &       91.55 &      86.05  \\ 
		ContextFlow++& 1 & \fst{88.64} & 99.19       &       98.66 & \fst{93.62} \\
		\bottomrule
	\end{tabular}
\end{table}

\textbf{SMAP unsupervised anomaly detection.} We compare our models to popular baselines: OmniAnomaly \citep{omnianomaly}, MTAD-GAT \citep{mtad_gat}, CAE-M \citep{cae_m}, GDN \citep{gdn} and TranAD \citep{tranad}. It is common in these baselines to train and evaluate a separate model for each SMAP entity (\# = 55). In contrast, our generalist model uses a single model for all entities, which leads to lower performance metrics in Table~\ref{tab:smap}. Then, we finetune our variational ContextFlow++ variant with the context defined as a discrete entity ID. This allows to significantly improve generalist's metrics (7.6 \pp~ gain in F$_1$ score \wrt the generalist result) and outperform the selected baselines. Our approach leads to a major drop in complexity since we train and keep all additive contexts in a single checkpoint and, additionally, our model learns the decoupled common generalist knowledge. % add something

\begin{figure}[t!]
	\begin{center}
		\centering
		\includegraphics[width=0.96\columnwidth]{figs/fig-qual-cifar.pdf}
		\caption{Top-1 accuracy of CIFAR-10C on test split vs. training epochs. Generalist model experiences significant accuracy drop when compared to the same model trained on the undistorted CIFAR-10. Our ContextFlow++ with $\argmax$-based context encoder explicitly decouples general and context-specific knowledge. In comparison with conventional conditioning method, ours converges faster and results in higher accuracy metric on CIFAR-10C.}
		\label{fig:qual_cifar}
	\end{center}
\end{figure}

\subsection{Qualitative experiments}\label{sec:qual}
Figure~\ref{fig:qual_cifar} visually compares top-1 test-set accuracy for a subset of Table~\ref{tab:cifarc} models vs. training epochs. We plot accuracy of the generalist model trained and evaluated on the undistorted CIFAR-10 as well as corrupted CIFAR-10C. Also, we show our ContextFlow++ and conventional conditioning method~\citep{lu2020structured} with exactly the same architectures and variational $\argmax$ context encoders.

Our ContextFlow++ approach has two main advantages as shown in Figure~\ref{fig:qual_cifar}. First, it starts with the generalist knowledge encoded in its parameters which significantly increases convergence stability and speed. Second, the added context encoder allows to employ domain-specific knowledge and increase final performance metric. In practice, it can be useful when thousands and millions of contexts are encoded in multidimensional mixed-variable vectors.

\subsection{Complexity analysis}
\label{sec:comp}

Table~\ref{tab:comp} shows complexity estimates for the flow models from Table~\ref{tab:cifarc} that are applied to CIFAR-10C dataset. We report parameter count and latency on P100 GPU with mini-batch size of 256 during the training and evaluation phases.

The low-complexity context encoders with uniform dequantization and deterministic embeddings have comparable to generalist model latencies (140 vs. 84 ms for training and 70 vs. 42 ms at evaluation), but can be very different in parameter count (2.9, 3.3 and 5.6 for ContextFlow++ variants vs. 2.2 millions for the generalist) depending on the context processing. At the same time, probabilistic context encoders with generative flow architecture have higher latency (400 vs. 84 ms for training and 200 vs. 42 ms at evaluation) and also variable parameter counts (4.0, 5.0 and 27.6 millions).

The parameter count for ContextFlow++ is lower than the conventional baseline~\citep{lu2020structured} due to lack of concatenation that increases the dimensionality of internal vectors. At the same time, the latencies for both methods are very similar due to the nature of Table~\ref{tab:prior} operations. To conclude, the $\argmax$ variant with the embedded $\log_2$ context compression can be a preferred method with further encoder architecture optimizations as a trade-off between complexity and promising performance gains in our experiments.

\begin{table}[t]
	\caption{Parameter count and P100 GPU latencies (batch size = 256) on CIFAR-10C. Results, reported as \citep{lu2020structured}$\rightarrow$ContextFlow++, show that out method has lower parameter count and similar to the prior method latency. Ours $\argmax$ variant with $\log_2$ context compression is preferable in terms of parameters-performance trade-off.}
	\label{tab:comp}
	\centering
	\small
	\begin{tabular}{c|ccc}
		\toprule
		\multirow{2}{*}{\shortstack{Metrics $\rightarrow$ \\ Method $\downarrow$}} & \multirow{2}{*}{\shortstack{Parameters,\\ millions}} & \multicolumn{2}{c}{Latency, ms} \\
		& & Train & Eval \\
		\midrule
		Generalist              & 2.2 & 84  &  42 \\
		\midrule
		Integer uniform         & 3.5$\rightarrow$ 2.9 & 138$\rightarrow$141 &  55$\rightarrow$71 \\
		Integer $\argmax$       & 4.1$\rightarrow$ 4.0 & 402$\rightarrow$403 &194$\rightarrow$185 \\
		One-hot uniform         & 4.0$\rightarrow$ 3.3 & 155$\rightarrow$152 & 74$\rightarrow$ 81 \\
		One-hot variational     & 5.1$\rightarrow$ 5.0 & 383$\rightarrow$398 &185$\rightarrow$205 \\
		Embed. determinist.     & 6.2$\rightarrow$ 5.6 & 140$\rightarrow$145 & 55$\rightarrow$ 69 \\
		Embed. stochastic       &27.7$\rightarrow$27.6 & 384$\rightarrow$380 &184$\rightarrow$198 \\
		\bottomrule
	\end{tabular}
\end{table}

\section{Conclusions}\label{sec:conc}
In this paper, we addressed the limitation of previous conditional normalizing flow models. Our additive contexts increased applicability of flow models to setups where flexible and accurate context-specific knowledge modeling is crucial. Then, we explored the related topic of enabling discrete variables in the conventional flow framework and proposed the mixed-variable ContextFlow++ architecture with additional generative flow-based context encoders.

Our experiments with supervised image classification, predictive maintenance and unsupervised anomaly detection showed advantages of our flow-based architecture with sampling from surjective context encoders followed by likelihood estimation using modified ContextFlow++ bijective decoder. We believe that this approach can be extended to recent ODE-type continuous flow architectures and other types of contextual information \eg relational graphs.

\bibliography{bib100}

\end{document}