%\documentclass{uai2023} % for initial submission
 \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


\usepackage[utf8]{inputenc} 
\usepackage[T1]{fontenc}    
\usepackage{hyperref}       
\usepackage{url}            
\usepackage{booktabs}       
\usepackage{amsfonts}       
\usepackage{nicefrac}       
\usepackage{microtype}      
\usepackage{algorithm}
\usepackage[noend]{algorithmic}

\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{amsthm}
\usepackage{dsfont}

%\usepackage[pdftex,dvipsnames]{xcolor}  %--> appears later already
\usepackage{wrapfig}
\usepackage{subfigure}
\usepackage{colortbl}
\usepackage{color}
\usepackage{xcolor} %--> appeared before already
\usepackage{multirow}

\usepackage{xargs}                      

%\usepackage[colorinlistoftodos,prependcaption,textsize=tiny,disable]{todonotes}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\newcommand{\MMD}{\ensuremath{\mathrm{MMD}}}
\newcommand{\COS}{\ensuremath{\mathrm{COS}}}
\newcommand{\R}{\ensuremath{{\mathbb R}}}
\newcommand*\rot{\rotatebox{90}}
\definecolor{verylightgray}{gray}{.75}
\definecolor{veryverylightgray}{gray}{.85}

\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\theoremstyle{definition}
\newtheorem{remark}{Remark}
\newtheorem{question}[theorem]{Question}



\title{Validation of Composite Systems by Discrepancy Propagation}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
%\author[1]{Harry~Q.~Bovik}
%\author[1,2]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
\author[1]{\href{mailto:david.reeb@de.bosch.com}{David~Reeb}{}}
\author[1]{Kanil~Patel}
\author[1]{Karim~Barsim}
\author[1]{Martin~Schiegg}
\author[1]{Sebastian~Gerwinn}
% Add affiliations after the authors
\affil[1]{%
	Bosch Center for Artificial Intelligence, Robert Bosch GmbH, 71272 Renningen, Germany
}

%\affil[1]{%
%    Computer Science Dept.\\
%    Cranberry University\\
%    Pittsburgh, Pennsylvania, USA
%}
%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }
  
  \begin{document}
\maketitle


\begin{abstract} 
	Assessing the validity of a real-world system with respect to given quality criteria is a common yet costly task in industrial applications due to the vast number of required real-world tests. 
	Validating such systems by means of simulation offers a promising and less expensive alternative, but requires an assessment of the simulation accuracy and therefore end-to-end measurements. 
	Additionally, covariate shifts between simulations and actual usage can cause difficulties for estimating the reliability of such systems. 
	In this work, we present a validation method that propagates bounds on distributional discrepancy measures through a composite system, thereby allowing us to derive an upper bound on the failure probability of the real system from potentially inaccurate simulations. 
	Each propagation step entails an optimization problem, where -- for measures such as maximum mean discrepancy (MMD) -- we develop tight convex relaxations based on semidefinite programs. 
	We demonstrate that our propagation method yields valid and useful bounds for composite systems exhibiting a variety of realistic effects. 
	In particular, we show that the proposed method can successfully account for data shifts within the experimental design as well as model inaccuracies within the simulation.
\end{abstract}


\section{Introduction}\label{sec:introduction}
Industrial products cannot be released without a priori ensuring their validity, i.e.\ the product must be validated to work according to its specifications with high probability.
Such validation is essential for safety-critical systems (e.g.\ autonomous cars, airplanes, medical machines) or systems with legal requirements (e.g.\ limits on output emissions or power consumption of new vehicle types), see e.g.\ \citep{kalra2016driving,koopman2016challenges,belcastro2003validation}.
When relying on real-world testing alone to validate system-wide requirements, one must perform enough test runs to guarantee an acceptable failure rate, e.g.\ at least $\sim10^6$ runs for a guarantee below $10^{-6}$. This is costly not only in terms of money but also in terms of time-to-release, especially when a failed system test necessitates further design iterations. 


System validation is particularly difficult for complex systems which typically consist of multiple components, often developed and tested by different teams under varying operating conditions. 
For example, an advanced driver-assistance system is built from several sensors and controllers, which come from different suppliers but together must guarantee to keep the vehicle safely on the lane. 
Similarly, the powertrain system of a vehicle consists of the engine or battery, a controller and various catalysts or auxiliary components, but is legally required to produce low output emissions of various gases or energy consumption per distance as a whole. 
In both these examples, the validation of the system can also be viewed as the validation of its control component, when the other subsystems are considered fixed. 
To reduce the costs of real-world testing including system assembly and release delays, one can employ \emph{simulations} of the composite system by combining models of the components, to perform {\emph{virtual validation}} of the system \citep{wong2020testing}.


\begin{figure*}[t]
	\centering
	\def\svgscale{0.4}
	\begin{tiny}
		\begingroup%
		\makeatletter%
		\providecommand\color[2][]{%
			\errmessage{(Inkscape) Color is used for the text in Inkscape, but the package 'color.sty' is not loaded}%
			\renewcommand\color[2][]{}%
		}%
		\providecommand\transparent[1]{%
			\errmessage{(Inkscape) Transparency is used (non-zero) for the text in Inkscape, but the package 'transparent.sty' is not loaded}%
			\renewcommand\transparent[1]{}%
		}%
		\providecommand\rotatebox[2]{#2}%
		\newcommand*\fsize{\dimexpr\f@size pt\relax}%
		\newcommand*\lineheight[1]{\fontsize{\fsize}{#1\fsize}\selectfont}%
		\ifx\svgwidth\undefined%
		\setlength{\unitlength}{919.76916216bp}%
		\ifx\svgscale\undefined%
		\relax%
		\else%
		\setlength{\unitlength}{\unitlength * \real{\svgscale}}%
		\fi%
		\else%
		\setlength{\unitlength}{\svgwidth}%
		\fi%
		\global\let\svgwidth\undefined%
		\global\let\svgscale\undefined%
		\makeatother%
		\begin{picture}(1,0.42068225)%
		\lineheight{1}%
		\setlength\tabcolsep{0pt}%
		\put(0,0){\includegraphics[width=\unitlength,page=1]{sys_validation_illustration.pdf}}%
		\put(0.56125682,0.41242186){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500954}\smash{\begin{tabular}[t]{l}Real system\end{tabular}}}}%
		\put(0.56281325,0.00503096){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Simulation\end{tabular}}}}%
		\put(0.11927005,0.21992652){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}data-shift (input discrepancy)\end{tabular}}}}%
		\put(0.44034819,0.23536321){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}model misfit\end{tabular}}}}%
		\put(0.58333844,0.22039384){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}model misfit\end{tabular}}}}%
		\put(0.76128626,0.25680023){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}model misfit\end{tabular}}}}%
		\put(0.04728231,0.32857664){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Field usage\end{tabular}}}}%
		\put(0.04738125,0.07165588){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Simulation input\end{tabular}}}}%
		\put(0.25785802,0.0987005){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$x$\end{tabular}}}}%
		\put(0.26527615,0.34906116){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$x$\end{tabular}}}}%
		\put(0.18679508,0.29227816){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$p$\end{tabular}}}}%
		\put(0.38603287,0.34815726){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Component\,$S^1$\end{tabular}}}}%
		\put(0.52427646,0.29891979){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Component\,$S^2$\end{tabular}}}}%
		\put(0.69888331,0.34856571){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Component\,$S^3$\end{tabular}}}}%
		\put(0.39233091,0.10326191){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Model\,$M^1$\end{tabular}}}}%
		\put(0.52950315,0.0533767){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Model\,$M^2$\end{tabular}}}}%
		\put(0.70904886,0.10379727){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}Model\,$M^3$\end{tabular}}}}%
		\put(0.8411801,0.1254051){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$M(x)$\end{tabular}}}}%
		\put(0.84088354,0.36545119){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$S(x)$\end{tabular}}}}%
		\put(0.18763029,0.04074488){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$q$\end{tabular}}}}%
		\put(0,0){\includegraphics[width=\unitlength,page=2]{sys_validation_illustration.pdf}}%
		\put(0.87465557,0.34838852){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$\int \mathds{1}_{S\!(\!x\!)\!>\!\tau}dS\!(\!x\!)dp(\!x\!)\leq F_\text{max}$\end{tabular}}}}%
		\put(0.87241147,0.09963788){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$\int \mathds{1}_{M\!(\!x\!)\!>\!\tau}dM\!(\!x\!)dq(\!x\!) $\end{tabular}}}}%
		\put(0.055,0.332){\includegraphics[width=1.02\unitlength,page=3,angle=340]{sys_validation_illustration.pdf}}%
		\put(0.9689482,0.22097558){\color[rgb]{0,0,0}\makebox(0,0)[lt]{\lineheight{1.87500477}\smash{\begin{tabular}[t]{l}$~$\end{tabular}}}}%
		\end{picture}%
		\endgroup% 
	\end{tiny}
	\caption{Illustration of our validation task: A real, composite system of interest (top) is modeled with corresponding simulation models (bottom). Measurements of the real system are available only for the individual components, while end-to-end simulation data can be generated from the models. The task of the virtual validation method is to estimate the real system performance $S$ based on the simulations $M$, incorporating simulation model misfits w.r.t.\ the real-world components as well as any data-shift between the simulation input distribution and the field usage to be expected in the real system.
		\label{fig:virtual_system_analysis}}
\end{figure*}


However, it is difficult to assess how much such a composite virtual validation can be trusted, because the component models may be inaccurate w.r.t.\ the real-world components ({\emph{simulation model misfits}}) or the simulation inputs may differ from the distribution of real-world inputs ({\emph{data-shift}}). 
Incorporating these inaccuracies within the virtual validation analysis is particularly important for reliability analyses \citep{bect2012sequential,dubourg2013metamodel,wang2016gaussian} in industrial applications with safety or legal relevance as those described above, where falsely judging a system to be reliable is much more expensive than false negatives. 
For this reason, we desire -- if not an accurate estimate -- then 
%-- if not an accurate estimate of the system's real performance -- then 
at least an upper bound on its true failure probability. 
Existing validation methods are especially lacking the composite (multi-component) aspect, where measurement data are available only for each individual component (Sec.\ \ref{sec:related_work}).


To state the problem mathematically, the goal of this work is to estimate an upper bound $F_\text{max}$  on the failure probability $\mathrm{Pr}\big[S(x)>\tau\big]$ of a system over real-world inputs $x\sim p(x)$:
\begin{align}
F_\text{max} \geq \underset{x,S}{\mathrm{Pr}}\big[S(x)>\tau\big] &= \int \mathds{1}_{S(x)>\tau} dS(x)dp(x), 
\label{eq:failure_prob}
\end{align}
where $S(x)$ measures the system performance upon input $x$, and $\tau$ is a critical performance threshold indicating a \emph{system failure}.
In the virtual validation setup, we assume that no end-to-end measurements from the full composite system $S$ are available, and thus the upper bound $F_\text{max}$ is to be estimated from the simulation $M$ composed of models $M^1,M^2,\ldots$, which are assumed to be given. 
This estimate must take into account \textit{model misfits} and \emph{data-shift} in the simulation input distribution (see Fig.\ \ref{fig:virtual_system_analysis}).
To assess the model misfits, we assume validation measurements from the individual components $S^1,S^2,\ldots$ to be given, e.g.\ from component-wise development (for details see Sec.\ \ref{sec:setup}).

In this paper, we develop a method to estimate $F_\text{max}$ from simulation runs by propagating bounds on distributional distances between simulation models and real-world components through the composite system.
This propagation method incorporates model misfits and data-shifts in a pessimistic fashion by iteratively maximizing for the worst-case output distribution that is consistent with previously computed constraints on the input.
Importantly, our method requires models and validation data from the individual components only, not from the full system.

Our main contributions can be summarized as follows:
\begin{enumerate}
	\item We propose a novel, distribution-free bound on the distance between simulation-based and real-world distributions, 
	without the need to have end-to-end measurements from the real world (Sec.\ \ref{subsec:validation-method}).
	\item We justify the method theoretically (Prop.\ \ref{prop:convergence}) and show its practicality in reliability benchmarks (Sec.\ \ref{sec:reli-benchm-eval}).
	\item We demonstrate that -- in contrast to alternative methods -- the proposed method can account for data-shifts as well as model inaccuracies (Sec.\ \ref{sec:experiments}).
\end{enumerate}


\section{Related Work}\label{sec:related_work}
Estimating the failure probability of a system is a core task in reliability engineering.
In the reliability literature, one focus is on making this estimation more efficient compared to naive Monte Carlo sampling by reducing the variance on the estimator of the failure probability.
Such classical methods include importance sampling \citep{rubinstein2004cross}, 
subset sampling \citep{au2001estimation}, 
line sampling \citep{pradlwarter2007application}, and
first-order \citep{hohenbichler1987new,du2012first,zhang2015first} or second-order \citep{kiureghian1991efficient,lee2012novel} Taylor expansions.
While being more efficient, they still require a large number of end-to-end function evaluations and {cannot incorporate more detailed simulations}.

Another line of research investigates how to reduce real-world function evaluations through virtualization of this performance estimation task \citep{xu2021machine}.
%Generally, the physical system is replaced by a surrogate model that is designed to be much cheaper to evaluate.
The failure probability is estimated based on a surrogate model and hence cannot account for mismatches between the system and its surrogate.
\citet{dubourg2013metamodel} proposed a hybrid approach, where the proposal distribution of the importance sampling depends on the learned surrogate model. 
While this approach accounts, to some extent, for model mismatches, the proposal distribution might still be biased by a poor surrogate model.
In summary, none of the approaches that are based on surrogate models provide a reliable bound on the true failure probability.
Furthermore, all these approaches require end-to-end measurements from the real system, 
ignoring the composite structure of the system.


In practice, however, the system output $S(\cdot)$ in Eq.\ \eqref{eq:failure_prob} refers to a complex system that often has a \emph{composite structure}.
That is, global inputs $x$ propagate through an arrangement, oftentimes termed a \emph{function network}, of subsystems or components, see Fig.\ \ref{fig:virtual_system_analysis}.
Exploiting such a structure is expected to have a notable impact on the target task, be it experimental design \citep{marque2019efficient},
calibration and optimization \citep{astudillo2019bayesian,astudillo2021bayesian,kusakawa2021bayesian,xiao2022projection}, 
uncertainty quantification \citep{sanson2019systems}, 
or system validation as presented here.

In the context of Bayesian Optimization (BO), for example, \citet{astudillo2021bayesian} 
construct a surrogate system of Gaussian Processes (GP) that mirrors the compositional structure of the system.
Similarly, \citet{sanson2019systems} discuss similarities of such structured surrogate models to Deep GPs \citep{damianou2013deep}, and extend this framework to local evaluations of constituent components.
However, learning (probabilistic) models of inaccuracies \citep{sanson2019systems,riedmaier2021unified} introduces further modeling assumptions and cannot account for data-shifts.
Instead, we aim at model-free worst-case statements. 


\citet{marque2019efficient} showed that a composite function can be efficiently modeled from local evaluations of constituent components in a sequential design approach.
\citet{friedman2021adaptive} extend this framework to cyclic structures of composite systems for adaptive experimental design.
They derive bounds on the simulation error in composite systems, although assuming knowledge of Lipschitz constants as well as uniformly bounded component-wise errors.



Stitching different datasets covering the different parts of a larger mechanism without loosing the causal relation was analyzed by \citet{chau2021bayesimp} and corresponding models were constructed, but the quality with which statements about the real mechanism can be made was not analyzed.

Bounding the test error of models under input-datashift was analyzed empirically in \citet{jiang2021assessing} by investigating the disagreement between different models. Although they find a correlation between disagreement and test error, the authors do not provide a 
%rigorous 
bound on the test error (Sec.\ \ref{sec:uncertainty-wrapper-method})
% and also cannot incorporate an existing simulation model into the analysis.


\section{Method}\label{sec:method}
\subsection{Setup: Composite System Validation}\label{sec:setup}
We consider a \emph{(real) system} or \emph{system under test} $S$ that is composed of subsystems $S^c$ ($c=1,2,\ldots,C$), over which we have only limited information. 
The validation task is to determine whether $S$ conforms to a given specification, such as whether the system output $y=S(x)$ stays below a given threshold $\tau$ for typical inputs $x$ -- or whether the system's \emph{probability of failure}, defined as violating the threshold, is sufficiently low, see Eq.\ (\ref{eq:failure_prob}).
Our approach to this task is built on a \emph{model} $M$ (typically a simulation, with no analytic form) of $S$ that is similarly composed of corresponding sub-models $M^c$. 
The main challenge in assessing the system's failure probability lies in determining how closely $M$ approximates $S$, in the case 
where the system data originate from disparate component measurements, which cannot be combined to consistent end-to-end data.


\textbf{Components and signals.} Mathematically, each component of $S$ -- and similarly for $M$ -- is a (potentially stochastic) map $S^c$, which upon input of a signal $x^c$ produces an output signal (sample) $y^c\sim S^c(\cdot|x^c)$ according to the conditional distribution $S^c$.
The stochasticity allows for aleatoric system behavior or unmodeled influences.
We consider the case where all signals are tuples $x^c=(x^c_1,\ldots,x^c_{d^c_{in}})$, such as real vectors.
The allowed ``compositions'' of the subsystems $S^c$ must be such that upon input of any signal (stimulus) $x$, an output sample $y\sim S(\cdot|x)$ can be produced by iterating through the components $S^c$ in order $c=1,2,\ldots,C$. 
More precisely, we assume that the input signal $x^c$ into $S^c$ is a concatenation of some entries $x|_{0\to c}$ of the overall input tuple $x$ and entries $y^{c'}|_{c'\to c}$ of some \emph{preceding} outputs $y^{c'}$ (with $c'=1,\ldots,c-1$); 
thus, $S^c$ is ready to be queried right after $S^{c-1}$. 
We assume the overall system output $y=y^C\in{\mathbb R}$ to be real-valued as multiple technical performance indicators (TPIs) could be considered separately or concatenated by weighted mean, etc. 
The simplest example of such a composite system is a \emph{linear chain} $S=S^C\circ\ldots\circ S^2\circ S^1$, where $x\equiv x^1$ is the input into $S^1$ and the output of each component is fed into the next, i.e.\ $x^{c+1}\equiv y^c$. 
Another example is shown in Fig.\ \ref{fig:virtual_system_analysis}, where $x^3$ is concatenated from both outputs $y^1$ and $y^2$. 
We assume the identical compositional structure for the model $M$ with components $M^c$.


\textbf{Validation data.} An essential characteristic of our setup is that neither $S$ nor the subsystem maps $S^c$ are known explicitly, and that ``end-to-end'' measurements $(x,y)$ from the full system $S$ are unavailable (see Sec.\ \ref{sec:introduction}).
Rather, we assume that \emph{validation data} are available only for every subsystem $S^c$, i.e.\ pairs $(x^c_v,y^c_v)$ of inputs $x^c_v$ and corresponding output samples $y^c_v\sim S^c(\cdot|x^c_v)$ ($v=1,\ldots,V^c$).
Such validation data may have been obtained by measuring subsystem $S^c$ in isolation on some inputs $x^c_v$, without needing the full system $S$;
note, the inputs $x^c_v$ do not necessarily follow the distribution from previous components.
In the same spirit, the models $M^c$ may also have been trained from such ``local'' system data; we assume $M^c$, $M$ to be given from the start. 


\textbf{Probability distributions.}We aim at probabilistic validation statements, namely that the system fails or violates its requirements only with low probability. 
For this, we assume that $S$ is repeatedly operated in a situation where its inputs come from a distribution $x\sim p_x$, in an i.i.d.\ fashion.
For the example where $S$ is a car, the input $x$ might be a route that typical drivers take in a given city. 
Importantly, we do not assume much knowledge about $p_x$: 
merely a number of samples $x_v\sim p_x$ may be given, or alternatively its distance to the simulation input distribution $q_x=\frac{1}{n_M}\sum_{n=1}^{n_M}\delta_{x^M_n}$; 
here, $\delta_{x^M_n}$ are point measures on the input signals $x^M_n$ on which $M$ is being simulated.
The input distribution $x\sim p_x$ will induce a (joint) distribution $p$ of all intermediate signals $x^c,y^c$ of the composite system $S$ and importantly the TPI output $y=y^C\sim S(x)$. 
Similarly, $M$ generates a joint model distribution $q$ by starting from $q_x$ and sampling through all $M^c$; 
via this simulation, we assume $q$ and all its marginals on intermediate signals $x^c,y^c$ to be available in sample-based form.
The \emph{(true) failure probability} is given by $p_\text{fail}=\int \mathds{1}_{S>\tau}dS(x)dp(x)$, where in this paper we identify a system failure as the TPI exceeding the given threshold $\tau$.
The model failure probability is $q_\text{fail}=\int \mathds{1}_{M>\tau}dM(x)dq(x) \simeq \frac{1}{n_M}\sum_{n}\mathds{1}_{y^M_n>\tau}$, where $y^M_n$ denote sampled model TPI outputs for the inputs $x^M_n$.
It is often useful in our setting to think of a distribution as a set of sample points, and vice versa. 


\textbf{Discrepancies.} To track how far the simulation model $M$ diverges from the true system behavior $S$ in our probabilistic setting, we employ discrepancy measures $D$ between probability distributions. 
Such a measure $D$ maps two probability distributions $p,q$ over the same space to a real number, often having some interpretation of distance. 
We consider MMD distances $D=\MMD_k$ \citep{gretton2012kernel}, defined as the RKHS norm $\MMD_k(p,q)=\|p-q\|_k=\big[\int_{x,x'}(p(x)-q(x))k(x,x')(p(x')-q(x'))dxdx'\big]^{1/2}$ w.r.t.\ a kernel $k$ on the underlying space (e.g.\ a squared-exponential or IMQ kernel \citep{gorham2017measuring}). 
Further possibilities include the cosine similarity $\COS_k(p,q)=\langle p,q\rangle_k/\|p\|_k\|q\|_k$ w.r.t.\ a kernel $k$, a Wasserstein distance $D=W_p$ w.r.t.\ a metric on the space, and the total variation norm $D=TV$ \citep{IPM_paper_gretton_2009,sriperumbudur2010non}; however, the latter cannot be estimated reliably from samples.


Specifically, we assume a discrepancy measure $D^{c'\to c}$ to be given\footnote{We will later address how to choose $D$ from a parameterized family $D_\ell$, e.g.\ with different lengthscales $\ell$.} for those pairs $0\leq c'<c\leq C+1$ for which (a sub-tuple of) the output signal $y^{c'}$ is fed into the input $x^c$ (cf.\ the compositional structure above, and where we define $y^{c'=0}\equiv x$ and $x^{c=C+1}\equiv y:=y^C$). 
This $D^{c'\to c}$ acts on probability distributions over the space of such sub-tuples like $y^{c'}|_{c'\to c}$ (or synonymously, $x^c|_{c'\to c}$), which is defined as the signal entries running from $y^{c'}$ to $x^c$. We denote the marginal of $p$ on these signal entries by $p|_{c'\to c}$, and similar $q|_{c'\to c}$ for $q$. 
In the simplest case of a linear chain, $D^{c'\to c}$ with $c'=c-1$ acts on probability distributions such as $p|_{c'\to c}$ over the space of the (full) vectors $y^{c'}=x^c$. 
We omit superscripts $D^{c'\to c}\equiv D$ when clear from the context. 


Our method requires (upper bounds on) the discrepancies $D(p|_{0\to c},q|_{0\to c})$ between marginals of the system and model input distributions $p_x,q_x$; 
specifically between the marginal distributions $p|_{0\to c}$ and $q|_{0\to c}$ over those sub-tuples $x|_{0\to c}$ which are input to subsequent components $c$. 
These discrepancies can either be estimated from samples $x_v,x^M_n$ of $p_x,q_x$, see the biased and unbiased estimates for MMD in \citet{gretton2012kernel}[App.\ A.2, A.3], which are accurate up to at most $\sim\sqrt{(1/n_{min})\log(1/\delta)}$ at confidence level $1-\delta$ (where $n_{min}$ denotes the size of the smaller of both sample sets); alternatively, these discrepancies may be directly given or upper bounded. 
These upper bounds are the quantities $B^{0\to c}$ below in Eq.\ (\ref{general-max-discrepancy-objective}). 
No further knowledge of the real-world input distribution $p_x$ is required.



%\subsection{Validation by discrepancy propagation}
\subsection{Discrepancy Propagation Method}\label{subsec:validation-method}

We now describe the key step in our method to quantify how closely the model's TPI output distribution, which we denote by $q_y\equiv q|_{C\to C+1}$, approximates the actual (but unknown) system output distribution $p_y\equiv p|_{C\to C+1}$. 
We do this by iteratively propagating worst-case discrepancy values through the (directed and acyclic) graph of components $S^c$/$M^c$, using only the available information, in particular the given validation data $(x^c_v,y^c_v)$ on a per-subsystem basis.


\textbf{Discrepancy bound propagation.} The basic idea is to go through the components $c=1,2,\ldots,C$ one-by-one. 
At each step $S^c$, we consider the ``input discrepancies'' $D(p|_{c'\to c},q|_{c'\to c})$ (for $c'<c$), about which we already have information, and propagate this to gain information about the ``output discrepancies'' $D(p_{c\to c''},q|_{c\to c''})$ (for $c''>c$). 
Here, we consider ``information'' in the form of inequalities $D(p|_{c'\to c},q|_{c'\to c})\leq B^{c'\to c}$, i.e.\ the information is the value of the (upper) bound $B^{c'\to c}$. 
Given bounds $B^{c'\to c}$ on the input signal of $S^c$, an upper bound on $D(p|_{c\to c''},q|_{c\to c''})$ for each fixed $c''>c$ can be found by maximizing the latter discrepancy over all (unknown) distributions $p$ that satisfy all the input discrepancy bounds:
\begin{align} 
B^{c\to c''} = \text{maximize}_p~&D(p|_{c\to c''},q|_{c\to c''})\label{general-max-discrepancy-objective}\\
\text{subject to}~~&D(p|_{c'\to c},q|_{c'\to c})\leq B^{c'\to c}~\forall c'<c. \nonumber
\end{align}
Note that the (sample-based) model distribution $q$ and its marginals in (\ref{general-max-discrepancy-objective}) are known and fixed after the simulation $M$ has been run on the input samples $x_n^M$ which constitute $q_x$ (see above).
In contrast, as the actual $p$ is not known, we maximize over all possible system distributions $p$ in (\ref{general-max-discrepancy-objective}) according to the bounds from the previous components $c'$. 


It remains to optimize over all possible sets of marginals $p|_{c\to c''},p|_{c'\to c}$ (for all $c'<c$) occurring in (\ref{general-max-discrepancy-objective}). 
Ideally, one would consider all distributions $p(x^c)$ over input signals $x^c$, apply $S^c$ to each $x^c$ to obtain all possible joint distributions $p(x^c,y^c)=p(x^c)S^c(y^c|x^c)$ of in- and outputs, and compute from this all possible sets of marginals $p|_{c\to c''},p|_{c'\to c}$. 
However, this is impossible as we do not know the action of $S^c$ on every possible input $x^c$.
Rather, we merely know about the action of $S^c$ on the validation inputs $x^c_v$, namely that $y^c_v\sim S^c(x^c_v)$ is a corresponding output sample.
We thus consider only the joint distributions $p(x^c,y^c)=p_\alpha$ that can be formed from the given validation data (Fig.\ \ref{input-output-distribution})\footnote{$\delta_{z_0}$ denotes a Dirac point mass at $z=z_0$.}:
\begin{align}\label{eq:p-alpha-joint}
p_\alpha=\sum_{v=1}^{V^c}\alpha_v\delta_{x^c_v}\delta_{y^c_v},
\end{align}
such that the optimization variable becomes now a probability vector $\alpha\in{\mathbb R}^{V^c}$, i.e.\ with nonnegative entries $\alpha_v\geq0$ summing to $\sum_v\alpha_v=1$. 
By restricting to this (potentially skewed) set of joint distributions $p_\alpha$, the exact bound $B^{c\to c''}$ turns into an estimate; for further discussion see Prop.\ \ref{prop:convergence}, which also proposes another possible parametrization $p_\alpha$.

\begin{figure}[t]
	\begin{center}
		\includegraphics[width=0.85\columnwidth]{p_alpha_illustration.png}
	\end{center}
	\caption{Illustration of the (marginals of the) joint input-output distribution $p_\alpha$ (\ref{eq:p-alpha-joint}), parameterized by weights $\alpha_v$. Corresponding in-/outputs $x_v,y_v$ have the same weight $\alpha_v$.\label{input-output-distribution}}
\end{figure}

Using ansatz (\ref{eq:p-alpha-joint}), the exact bound propagation (\ref{general-max-discrepancy-objective}) becomes:
\begin{align} 
B^{c\to c''} =\text{max}_\alpha~&D(p_\alpha|_{c\to c''},q|_{c\to c''})\label{max-discrepancy-objective-alpha}\\
\text{s.t.}~~&D(p_\alpha|_{c'\to c},q|_{c'\to c})\leq B^{c'\to c}~~\forall c'<c,\nonumber\\
&\alpha\geq0,~~\sum_v\alpha_v=1. \nonumber
\end{align}
Note that for sample-based distributions like $p_\alpha$ in (\ref{eq:p-alpha-joint}) or $q$, the marginals in this optimization have a similar form, e.g.\ $p_\alpha|_{c\to c''}=\sum_v\alpha_v\delta_{y^c_v|_{c\to c''}}$ or $p_\alpha|_{c'\to c}=\sum_v\alpha_v\delta_{x^c_v|_{c'\to c}}$.

As the discrepancy measures $D$ in (\ref{max-discrepancy-objective-alpha}) are usually convex, this optimization problem is ``almost'' convex: 
All its constraints are convex, however, we aim to \emph{maximize} a convex objective. 
For MMD measures we derive convex (semidefinite) relaxations of (\ref{max-discrepancy-objective-alpha}) by rewriting it with squared MMDs $D(p_\alpha,q)^2$, which are quadratic in $\alpha$ and thus \emph{linear} in a new matrix variable $A=\alpha\alpha^T$; 
this last equality is then relaxed to the semidefinite inequality $A\geq\alpha\alpha^T$ (App.~A). 
% A = \ref{app:semidefinite-relaxation} % Appendix reference hard-coded
While the relaxation is tight in most instances (App.\ E.4), the number of variables increases from $V^c$ to $\sim\!(V^c)^2/2$, restricting the method to $V^c\lesssim10^3$ validation samples per component. 
In our implementation, we solve these SDPs using the CVXPY package \citep{diamond2016cvxpy}.



\textbf{Bounding the failure probability.} The final step of the preceding \emph{bound propagation} yields an upper bound $B^y:=B^{C\to C+1}$ on the discrepancy $D(p_y,q_y)$ between the (unknown) system TPI output distribution $p_y$ and its model counterpart $q_y$, which is given by samples $y^M_n$. 
We now apply an idea similar to (\ref{eq:p-alpha-joint}) to obtain (a bound on) the system failure probability $p_\text{fail}:=\int_{y>\tau}p_y(y)dy$: 
Rather than maximizing $p_\text{fail}$ over all distributions $p_y$ on ${\mathbb R}\ni y$ subject to the constraint $B^y$, we make the optimization finite-dimensional by selecting grid-points $g_1<g_2<\ldots<g_V\in{\mathbb R}$ and parameterizing $p_y\equiv p_\alpha=\sum_{v=1}^V\alpha_v\delta_{g_v}$, such that $p_\text{fail}=\sum_{v:g_v>\tau}\alpha_v$. 
In practice, we choose an equally-spaced grid in an interval $[g_\text{min},g_\text{max}]\subset{\mathbb R}$ that covers the ``interesting'' or ``plausible'' TPI range, such as the support of $q_y$ as well as sufficient ranges below and above the threshold $\tau$. 
The size of the optimization problem corresponds to the number of grid-points $V$, so $V\!\simeq\!10^3$ is easily possible here.

With this, our final upper bound $p_\text{fail}\leq F_\text{max}$ on the failure probability becomes the following convex program:
\begin{align}\label{max-failure-objective-alpha}
F_\text{max}~ = \text{max}_\alpha~&\sum_{v:g_v>\tau}\alpha_v\\
\text{s.t.}~~&D(p_\alpha,q_y)\leq B^y,~~\alpha\geq0,~~\sum_v\alpha_v=1. \nonumber
\end{align}
One can obtain better (i.e.\ smaller) bounds $F_\text{max}$ by restricting $p_\alpha$ further by plausible assumptions: 
(a) \emph{Monotonicity:} 
When bounding a tail probability, i.e.\ $p_\text{fail}$ is expected to be small, it may be reasonable to assume that $p_y$ is monotonically decreasing beyond some tail threshold $\tau'$. 
For an equally-spaced grid this adds constraints $\alpha_v\leq\alpha_{v-1}$ for all $v$ with $g_v\geq\tau'$ to (\ref{max-failure-objective-alpha}); 
we always assume this with $\tau':=\tau$. 
(b) \emph{Lipschitz condition:} 
To avoid that $p_\alpha$ becomes too ``spiky'', we pose a Lipschitz condition $|\alpha_{v+1}-\alpha_v|\leq\Lambda_\text{max}|g_{v+1}-g_v|$ with a constant $\Lambda_\text{max}$ estimated from the set of outputs $y^M_n$. 
See also App.\ B. 
% B = \ref{app:violation-optimization} % Appendix reference hard-coded

Note that our final bound $F_\text{max}$ is a probability, whose interpretation is independent of the chosen discrepancy measures, kernels, or lengthscales. 
We can thus select these ``parameters'' by minimizing the finally obtained $F_\text{max}$ over them. 
We do this using Bayesian optimization \citep{frohlich2020noisy}.


We summarize our full discrepancy propagation method to obtain a bound $F_\text{max}$ on the system's failure probability in Algorithm \ref{algorithm:DPBound}, which we refer to as \texttt{DPBound}.


\begin{algorithm}
	\caption{\texttt{DPBound}}
	\label{algorithm:DPBound}
	\begin{algorithmic}[1]
		\STATE {\bfseries Input:} compositional structure of $S$;  composite simulation model $M$; discrepancy measures $D$; validation data $(x_v^c,y_v^c)$; simulation input samples $\{x_n^M\}\equiv q_x$; either \textit{(a)} upper bounds $B^{0\to c}$ on input discrepancies or \textit{(b)} samples $x_v\sim p_x$ from the real-world input distribution.
		
		\STATE Run $M$ on all $x_n^M$; collect all intermediate signals $x^c_n,y^c_n$ to build the sample-based marginals $q|_{c'\to c}$ of $q$.
		
		\STATE In case \textit{(b)}, estimate $B^{0\to c}$ from $p_x\simeq\{x_v\}$ and $q_x$.
		
		\FOR{$c=1,\ldots,C$}
		\FOR{every $c''=c+1,\ldots,C+1$ connected to $c$}
		\STATE Compute $B^{c\to c''}$ via Eq.\ (\ref{max-discrepancy-objective-alpha}) (or via App.\ A).
		% A = \ref{app:semidefinite-relaxation} % Appendix reference hard-coded
		\ENDFOR
		\ENDFOR
		
		\STATE Using the thus obtained $B^y:=B^{C\to C+1}$, compute the final bound $F_\text{max}$ via Eq.\ (\ref{max-failure-objective-alpha}) (or via App.\ B).
		% B = \ref{app:violation-optimization} % Appendix reference hard-coded
	\end{algorithmic}
\end{algorithm}


\textbf{Upper bound property.} We replaced the optimization over all possible system distributions $p$ in (\ref{general-max-discrepancy-objective}) by the distributions $p_\alpha$ from (\ref{eq:p-alpha-joint}) due to the limited system validation data and to make the optimization tractable. 
This restricted and possibly skewed $p_\alpha$ can potentially cause $B^{c\to c''}$ and ultimately $F_\text{max}$ from (\ref{max-discrepancy-objective-alpha}),(\ref{max-failure-objective-alpha}) to not be true upper bounds on $D$ or even the system's (unknown) failure probability $p_\text{fail}$, although the worst-case tendency of the maximizations alleviates the issue. 
We investigate this in the experiments (Sec.\ \ref{sec:reli-benchm-eval}), and in the following proposition we state conditions under which (\ref{max-discrepancy-objective-alpha}),(\ref{max-failure-objective-alpha}) are upper bounds:
\begin{proposition}\label{prop:convergence}
	Suppose that for each component $c=1,\ldots,C$: 
	{\it(i)} the validation inputs $x^c_v$ cover the space of occurring inputs into $S^c$;
	{\it(ii)} (necessary only for components $S^c$ having stochastic output) the $\delta_{y^c_v}$ in the defining equation of $p_\alpha$ (Eq.\ (\ref{eq:p-alpha-joint})) is replaced by the system output distribution $S^c(x^c_v)$ (represented e.g.\ by samples or its kernel-mean embedding); 
	{\it(iii)} the grid $\{g_v\}$ covers the occurring TPI values (e.g.\ discrete and bounded). Then $p_\text{fail}\leq F_\text{max}$, where $F_\text{max}$ is defined by the computations in Eqs.\ (\ref{max-discrepancy-objective-alpha}) and (\ref{max-failure-objective-alpha}) (or alternatively, by the convex relaxations and forms in Apps.\ A and B).
\end{proposition}

App.\ C 
% C = \ref{app:proof-proposition} % Appendix reference hard-coded
gives a proof as well as an additional limit statement about the $B^{c\to c''}$ and $F_\text{max}$ in the more realistic setting of increasingly dense inputs $x^c_v$ and approximations of $S^c(x^c_v)$.



\subsection{Failure Bound via Surrogate Model}\label{sec:uncertainty-wrapper-method}
As an alternative, more heuristic but simpler, 
baseline method to estimate an upper bound $F_\text{max}$ on the failure probability $p_\text{fail}$, we introduce a sampling-based method that also operates on the available data only. 
The underlying concept in quantifying model accuracy is similar to the one from \citep{jiang2021assessing} and can also be thought of one particular form of error modeling \citep{riedmaier2021unified}. 

The general idea is to train an additional ``surrogate'' model $M'^c$ for each system component $S^c$, and employ the resulting composite $M'$ to estimate how far $M$ deviates from the real $S$ in terms of TPI outputs. 
This is possible because for training $M'^c$ we use the available validation data $(x^c_v,y^c_v)$ measured from $S^c$. 
We take Gaussian processes (GPs) for $M'^c$, but other probabilistic models like normalizing flows or deterministic ones like neural networks are possible as well.
Choosing $M'^c$ from a different model class than $M^c$ will generally lead to a more conservative estimate $F_\text{max}$. 

After training the $M'^c$, we run the resulting composite surrogate model $M'$ on the simulation input samples $x^M_n$ (which make up $q_x$) to obtain TPI output samples $y'^M_{n,i}$ (with $k$ repetitions $i=1,\ldots,k$), in the same way  the given model $M$ can generate outputs $y^M_{n,i}$ from given $x^M_n$. 
By comparing the $y'^M_{n,i}$ to the $y^M_{n,i}$ we obtain a heuristic estimate of the error of $M$ in simulating the actual TPI of system $S$, see \citep{jiang2021assessing}. 
Concretely in this paper, we use the averaged outputs $y^M_n:=(1/k)\sum_i y^M_{n,i}$ and similarly $y'^M_n$, taking as the simulation error $\Delta$ a high quantile (here, 95\%) of the (signed or absolute) deviations:
\begin{align}\label{eq:delta-surr-model}
\Delta=\text{quantile}_{0.95}[\{y'^M_n-y^M_n\}].
\end{align}
The 100\%-quantile $\max_n(y'^M_n-y^M_n)$ would lead to more conservative bounds $F_\text{max}$. For an illustration see App.\ E.2.
% E.2 = \ref{sec:signal_propagation_illustration} % Appendix reference hard-coded

Finally, we estimate an upper bound on $p_\text{fail}$ by including a safety margin $\Delta$ before the threshold $\tau$:
\begin{align}\label{eq:Fmax_estimate_UW}
F_\text{max}:=\int_{\tau-\Delta}^{\infty}q(y)dy=\frac{1}{n_M\cdot k}\sum_{n,i}\mathds{1}_{y^M_{n,i}>\tau-\Delta}.
\end{align}
Note that this method \emph{cannot} account for discrepancies between the (unknown) system input distribution $p_x$ w.r.t.\ which we would like to bound the failure probability $p_\text{fail}$, and the given simulation inputs $x^M_n$ that make up $q_x$, see Sec.\ \ref{sec:experiments}. Rather, the method can be expected to work well only when the sample-based $q_x$ is close to the actual $p_x$. Furthermore, if simulation model and surrogate model share unjustified modeling assumptions, an agreement between the two models might mask differences to the actual system.




\section{Experiments}\label{sec:experiments}
We evaluate the proposed method on $8$ benchmark systems in Sec.~\ref{sec:reli-benchm-eval}.\footnote{Our organization is carbon neutral. Therefore, all its activities including research activities (e.g., compute clusters) no longer leave a carbon footprint.} 
For each system, we create $4$ configurations where the simulation models and/or the simulation input distributions differ.
These configurations are illustrated on an artificial example in Sec.~\ref{sec:experiment_linear_use-case}. 
%, followed by an evaluation on benchmark systems in Sec.~\ref{sec:reli-benchm-eval}.


%\subsection{Illustrative example: Linear Gaussian models} 
\subsection{Illustration: Gaussian Models}\label{sec:experiment_linear_use-case}
As an exemplary validation problem, consider the following one-component, one-dimensional setup: 
For a linear system  $S:\R\rightarrow\R$ with $S(x)=w_S x + b_S$, we want to assess the failure probability in (\ref{eq:failure_prob}) by means of a linear model $M(x) = w_M x + b_M $. 
Both $S$ and $M$ are stimulated with samples from Gaussian distributions $p_x=\mathcal{N}(\mu_p, \sigma_p^2)$ and $q_x=\mathcal{N}(\mu_q, \sigma_q^2)$, respectively.
We can control the accuracy of $M$ and its input distribution $q_x$ separately, thereby investigating their impacts on the estimated failure probability.



Specifically, we analyze the following two configurations: (a) $M$ and $S$ differ in $b_M\neq b_S$, but they  receive identical inputs $q_x=p_x$ (\emph{Misfit Model \& Perfect Input}); and (b) the model $M$ is identical to $S$, but their respective input distributions $q_x\neq p_x$ differ (\emph{Perfect Model \& Biased Input}).
Under these two configurations, \texttt{DPBound} is illustrated for a single propagation step in Fig.~\ref{fig:linear_use_case_illustration}, where the marginal output distributions of $M$ and $S$ differ (top row marginals in brown resp.\ blue).
However, the output discrepancy bound $B^{1\to2}$ from Eq.\ (\ref{max-discrepancy-objective-alpha}) originates differently in the two configurations: In Fig.\ \ref{fig:b}, the input discrepancy $B^{0\to1}$ 
vanishes due to identical inputs $p_x=q_x$.
The output bound (\ref{max-discrepancy-objective-alpha})  thus directly reflects the difference between the output marginals $S(q_x)$ and $M(q_x)$, without optimization since the constraint forces the weights $\alpha$ to be uniform so that $p_\alpha=q_x$ 
(here, we assumed as validation inputs the $q_x$-samples for simplicity). 

\begin{figure}[t]
	\begin{center}
		\subfigure[]{\label{fig:b}\includegraphics[width=0.235\textwidth]{model_mismatch_input_perfect_without_uw.pdf}}
		\subfigure[]{\label{fig:a}\includegraphics[width=0.235\textwidth]{model_perfect_input_bias_without_uw.pdf}}
	\end{center}
	\caption{Illustration of \texttt{DPBound} for a linear mapping between (samples from) Gaussian signals.
		{\textbf{(a)}} There is model mismatch $M\neq S$, but the input distribution $q_x=p_x$ is perfect.
		{\textbf{(b)}} $M=S$ is a perfect model, but the model input distribution $q_x\neq p_x$ is biased w.r.t.\ the real world.
		The computed weights $\alpha_v$ from Eqs.\ (\ref{eq:p-alpha-joint}),(\ref{max-discrepancy-objective-alpha}) are depicted by the size of the blue $S(x)$-markers ($\alpha$ is uniform in case (a)).\label{fig:linear_use_case_illustration}}
\end{figure}

In Fig.\ \ref{fig:a}, however, the bound on the output discrepancy stems solely 
from the non-zero input discrepancy $B^{0\to1}=MMD(p_x,q_x)>0$, rather than from any difference between $M$ and $S$:
\texttt{DPBound} finds in Eq.\ (\ref{max-discrepancy-objective-alpha}) the worst-case weighted output distribution $p_\alpha$ consistent with the input discrepancy $B^{0\to1}$. 
The output bound is then the nonzero difference between this $p_\alpha$ and $M(q_x)$, even though both $p_\alpha$ and $M(q_x)$ were built with outputs from $S=M$. 


In this way, our method \texttt{DPBound} can account for both model misfits and biased inputs. 
The latter is not true of the \texttt{SurrModel} method from Sec.\ \ref{sec:uncertainty-wrapper-method}, as it ignores the real-world distribution $p_x$ and can thus fail for biased inputs $q_x\neq p_x$. For further details and illustrations, see App.\ E.
% E = \ref{appendix:illustrations} % Appendix reference hard-coded

\begin{table*}[t]
	\caption{Bounds on the failure probability (in \%, with standard deviations from 5 repetitions) delivered by the three compared methods for each benchmark problem under the four simulation configurations \emph{Perfect vs.\ Misfit Model} and \emph{Perfect vs.\ Biased Input}. Each problem has been normalized such that the ground-truth failure probability is 1\%. Also shown (in bold) is the ratio of invalid bounds (i.e.\ bounds below 1\%) delivered by each method among the 40 runs per configuration.\label{tab:results_benchmark}}
	
	
	\begin{center}
		%\footnotesize
		\resizebox{0.90\textwidth}{!}{ 
			\begin{tabular}{l|l|rrr|rrr}
				&            & \multicolumn{3}{c|}{\textbf{Perfect Model}}  & \multicolumn{3}{c}{\textbf{Misfit Model}}\\
				& \cellcolor{verylightgray} Problem &   \cellcolor{verylightgray} DPBound &       \cellcolor{verylightgray} MCCP &  \cellcolor{verylightgray} SurrModel &     \cellcolor{verylightgray} DPBound &       \cellcolor{verylightgray} MCCP &  \cellcolor{verylightgray} SurrModel \\
				\midrule
				& \multicolumn{7}{c}{\cellcolor{veryverylightgray}{Single Component}}  \\
				\multirow{9}{*}{\textbf{\rot{Perfect Input}}} 
				& Borehole &  1.10 $\pm$ 0.2 & 1.87 $\pm$ 0.3 & 3.04 $\pm$ 1.3 &   1.75 $\pm$ 1.7 & 1.06 $\pm$ 0.4 & 3.48 $\pm$ 1.6 \\
				& Branin & 19.98 $\pm$ 3.3 & 2.86 $\pm$ 0.6 & 1.84 $\pm$ 0.5 &  18.87 $\pm$ 2.8 & 2.76 $\pm$ 0.8 & 1.80 $\pm$ 0.7 \\
				& Four Branch &  23.1 $\pm$ 1.0 & 2.08 $\pm$ 0.4 & 1.48 $\pm$ 0.6 &  22.07 $\pm$ 0.7 & 1.07 $\pm$ 0.2 & 0.84 $\pm$ 0.6 \\
				& \multicolumn{7}{c}{\cellcolor{veryverylightgray}{Multiple Components}}  \\
				& Chained Solvers & 14.92 $\pm$ 2.7 & 2.08 $\pm$ 0.4 & 1.28 $\pm$ 0.6 &  15.21 $\pm$ 2.2 & 2.18 $\pm$ 0.6 & 1.60 $\pm$ 0.4 \\
				& Borehole &  8.94 $\pm$ 6.1 & 2.03 $\pm$ 0.5 & 3.12 $\pm$ 1.9 &   7.51 $\pm$ 2.6 & 2.06 $\pm$ 0.9 & 3.52 $\pm$ 1.6 \\
				& Branin & 17.91 $\pm$ 6.8 & 2.71 $\pm$ 0.5 & 1.56 $\pm$ 0.4 &   16.6 $\pm$ 6.3 & 2.82 $\pm$ 0.2 & 1.88 $\pm$ 0.4 \\
				& Four Branch &  36.4 $\pm$ 3.1 & 2.18 $\pm$ 0.8 & 1.52 $\pm$ 0.7 &  35.82 $\pm$ 3.2 & 0.81 $\pm$ 0.2 & 0.56 $\pm$ 0.3 \\
				& Controlled Solvers & 10.39 $\pm$ 4.6 & 1.97 $\pm$ 0.7 & 0.92 $\pm$ 0.5 &  11.01 $\pm$ 4.3 & 1.91 $\pm$ 0.7 & 0.88 $\pm$ 0.5 \\
				
				\hline
				&    \textbf{\% Invalid Bounds} & \textbf{5.0} & \textbf{0.0} & \textbf{17.5} & \textbf{0.0} & \textbf{22.5} & \textbf{27.5}\\
				\midrule
				\multirow{9}{*}{\textbf{\rot{Biased Input}}} 
				& \multicolumn{7}{c}{\cellcolor{veryverylightgray}{Single Component}}  \\
				&       Borehole & 16.57 $\pm$ 8.3 & 0.80 $\pm$ 0.3 & 0.88 $\pm$ 0.8 & 15.19 $\pm$ 10.5 & 0.60 $\pm$ 0.0 & 0.76 $\pm$ 0.8 \\
				&         Branin &  92.7 $\pm$ 3.3 & 0.73 $\pm$ 0.3 & 0.08 $\pm$ 0.2 &  93.35 $\pm$ 2.9 & 2.01 $\pm$ 0.8 & 1.00 $\pm$ 0.6 \\
				&    Four Branch & 22.69 $\pm$ 0.9 & 1.98 $\pm$ 0.5 & 1.28 $\pm$ 0.5 &  21.96 $\pm$ 1.0 & 0.88 $\pm$ 0.2 & 0.60 $\pm$ 0.3 \\
				& \multicolumn{7}{c}{\cellcolor{veryverylightgray}{Multiple Components}}  \\
				& Chained Solvers & 26.39 $\pm$ 1.7 & 0.74 $\pm$ 0.2 & 0.08 $\pm$ 0.1 &   26.7 $\pm$ 1.7 & 1.39 $\pm$ 0.7 & 0.56 $\pm$ 0.5 \\
				& Borehole & 20.96 $\pm$ 9.0 & 0.93 $\pm$ 0.4 & 0.76 $\pm$ 0.8 &  15.89 $\pm$ 7.5 & 0.60 $\pm$ 0.0 & 0.44 $\pm$ 0.3 \\
				&   Branin & 89.52 $\pm$ 6.1 & 0.81 $\pm$ 0.2 & 0.12 $\pm$ 0.1 &  89.52 $\pm$ 6.1 & 0.91 $\pm$ 0.5 & 0.24 $\pm$ 0.3 \\
				& Four Branch & 35.83 $\pm$ 3.2 & 1.96 $\pm$ 0.9 & 1.28 $\pm$ 0.7 &  35.42 $\pm$ 2.9 & 0.74 $\pm$ 0.2 & 0.44 $\pm$ 0.2 \\
				& Controlled Solvers & 13.36 $\pm$ 4.1 & 0.60 $\pm$ 0.0 & 0.00 $\pm$ 0.0 &  13.51 $\pm$ 5.9 & 0.67 $\pm$ 0.2 & 0.04 $\pm$ 0.1 \\
				\hline
				&    \textbf{\% Invalid Bounds} & \textbf{0.0} & \textbf{67.5} & \textbf{70.0} & \textbf{0.0} & \textbf{67.5} & \textbf{75.0} \\
				\bottomrule
			\end{tabular}
		}
	\end{center}
\end{table*}


\begin{table*}
	\caption{Compared benchmark systems.\label{tab:compared_benchmark_systems}}
	\begin{center}
		%\scriptsize
		\begin{tabular}{ |l|c|c| } 
			\hline
			Benchmark System & Input Dim. & Components \\
			\hline
			{\bf{Controlled Solvers}}~\citep{sanson2019systems} & 16 & 4 \\
			{\bf{Chained Solvers}}~\citep{sanson2019systems} & 1 & 2 \\
			{\bf{Borehole}}~\citep{sim_bench_website} & 8 & 1 / 5 \\
			{\bf{Branin}}~\citep{sim_bench_website} & 2 & 1 / 3 \\
			{\bf{Four Branch}}~\citep{UQworld} & 2 & 1 / 4 \\
			\hline
		\end{tabular}
	\end{center}
\end{table*}



\subsection{Reliability Benchmark Evaluation}\label{sec:reli-benchm-eval}
We now demonstrate the feasibility of our discrepancy propagation method in a reliability benchmark.

\textbf{Compared benchmark systems.} The performance of \texttt{DPBound} is evaluated on 3 single-component and 5 multi-component problems from the reliability and uncertainty propagation literature.
These problems are briefly summarized in Tab.\ \ref{tab:compared_benchmark_systems} below (see App.\ D 
% D = \ref{sec:deta-exper-sect} % Appendix reference hard-coded 
for more details).

For the evaluation, we set the threshold $\tau$ on the scalar output for each of those problems such that the ground-truth failure probability $\textrm{Pr}_{x\sim p_x}[S(x)>\tau]=1\%$ (see Eq.\ (\ref{eq:failure_prob})).

We evaluate each system under four different simulation configurations (cf.\ Sec.\ \ref{sec:experiment_linear_use-case}): As simulation models, we take either \emph{Perfect Models} $M^c=S^c$ or GP-based \emph{Misfit Models} $M^c\neq S^c$; as simulation input distribution, we take either \emph{Perfect Input} $q_x=p_x$ or \emph{Biased Input} $q_x\neq p_x$ (App.\ D).
% D = \ref{sec:deta-exper-sect}) % Appendix reference hard-coded


\textbf{Compared virtual validation methods.} We compare our failure probability bound with two alternative methods:\\
{\bf{\texttt{DPBound} \textbf{(ours):}}}
Failure bound $F_\text{max}$ calculated by propagating MMD-based bounds according to Algorithm \ref{algorithm:DPBound}.\\ 
{\bf{\texttt{MCCP}}\textbf{:}}
95\%-confidence Clopper-Pearson bound on the failure probability, calculated on binary Monte-Carlo samples obtained by thresholding the output of the simulation model. \\ 
{\bf{\texttt{SurrModel}}\textbf{:}}
Bound from Eq.\ (\ref{eq:Fmax_estimate_UW}), obtained by accounting for the difference between the simulation and a GP-based surrogate model learned on the validation data (Sec.\ \ref{sec:uncertainty-wrapper-method}).

\textbf{Experimental results.} The obtained results are summarized in Tab.\ \ref{tab:results_benchmark}. 
We first focus on the validity of the methods (bold numbers in Tab.\ \ref{tab:results_benchmark}): 
In this regard one can see that, in the ``Perfect Input--Perfect Model'' setting the methods produce generally valid bounds, with at most 17.5\% invalidness for \texttt{SurrModel}. 
In the more challenging and realistic settings with misfit and/or input bias, however, the invalidness ratios for \texttt{MCCP} and \texttt{SurrModel} increase beyond acceptable levels, especially for \emph{Biased Input} reaching invalidness up to 75\%. 
\texttt{DPBound} on the other hand remains perfectly valid under both \emph{Misfit Model} and \emph{Biased Input} (as in Sec.\ \ref{sec:experiment_linear_use-case}).

To understand why \texttt{MCCP} and \texttt{SurrModel} have challenges with the \emph{Biased Input} setting, notice that these methods disregard the actual system inputs $p_x$, instead relying solely on the simulation inputs $q_x$ 
without any means of dealing with a potential discrepancy between both distributions.
When the simulation input distribution is biased towards significantly lower simulated TPI values, the delivered bounds can then be invalid (for an illustration see App.\ E.2). 
% E.2 = \ref{sec:signal_propagation_illustration}) % Appendix reference hard-coded 
\texttt{DPBound} on the other hand natively accounts for this input discrepancy through the initial bounds $B^{0\to c}$, which in our experiments are estimated via samples from $p_x$, $q_x$ (see end of Sec.\ \ref{sec:setup}). 
In addition to this shortcoming of ignoring input discrepancies, \texttt{MCCP} remains unaware of any potential \emph{Misfit Model}, as it completely ignores the system $S$ and its validation data. 
This explains the jump in invalidness from 0.0\% to 22.5\% when isolating this effect in the \emph{Perfect Input} setting. 
Note, while \texttt{MCCP} is not expected to produce upper bounds in 100\% of cases due to its 95\%-confidence specification, it stays significantly below the 95\% promise (App.\ E.3).
% E.3 = \ref{app:MCCP99} % Appendix reference hard-coded 
%, for the above reasons (see App.\ E.3 
% E.3 = \ref{app:MCCP99} % Appendix reference hard-coded 
%for further discussion). 


Although \texttt{DPBound} provides valid upper bounds $F_\text{max}$ in almost all cases, it can sometimes still underestimate, as happened in two validation runs for Borehole(Single) under \emph{Perfect Input--Perfect Model} with bounds close to the 1\% ground-truth.  
To explain how this can happen, note that only under dense sampling conditions is \texttt{DPBound} expected to be perfectly valid (Prop. \ref{prop:convergence} and App.\ C). 
% C = \ref{app:proof-proposition}) % Appendix reference hard-coded 
In any case, \texttt{DPBound} shows high validity overall, while the two competing methods' likelihood for falsely positive validation is certainly too high for trustable statements.

Summarizing these validity results, of the three compared methods, only \texttt{DPBound} should be further considered to be viable at all as a reliable validation method. 

Despite its high validity, \texttt{DPBound} delivers bounds $F_\text{max}$ that are mostly far from trivial (i.e.\ much below 100\%), also in the challenging misfit and/or biased settings. 
These bounds in conjunction with their high validity thus yield useful information, which can serve as a basis for
extended validation approaches (see Conclusion). 
The fact that \texttt{MCCP} and \texttt{SurrModel} often produce much smaller or ``tighter''\footnote{Note, the ``tightness'' of the bounds can be read off from Tab.~\ref{tab:results_benchmark} by subtracting from each bound the ground-truth value of 1\%.} bounds is no advantage per se, as these bounds are often invalid and thus misleading in validation (see above). 
We did not focus on the thightness evaluation because our main evaluation criterion was the rate of falsely positive validations, which already excluded both competing methods in safety-relevant situations.

While it remains for future work to combine \texttt{DPBound}'s non-trivial and reliable bounds with other validation approaches, our investigation here constitutes the first one into the validity of validation methods in the component-wise setting, establishing \texttt{DPBound} as a viable candidate.




\section{Conclusion}\label{sec:conclusion}
Validating complex composite systems is a notoriously difficult 
task, e.g.\ validating the performance of autonomously driving vehicles.
Instead of expensively testing the system in the real world, simulations can reduce the validation effort \citep{wong2020testing}.
However, it is hard to assess the effect of simulation model inaccuracies or 
%shifts within the input test set 
input data shifts on the validation target, especially for composite systems.
We have developed a method to estimate an upper bound on the system failure probability, as underestimation of failure rates is typically much more costly than overestimation. 
Our method assumes that a simulation model as well as measurement data for each subsystem are available. 
Our evaluations show that 
%the method provides bounds that are 
the obtained bounds are useful and valid in general, with theoretical guarantees in the large-data limit (Prop.\ \ref{prop:convergence}). 

Due to its individual-component nature, our method is especially fit to use when only one component in an already deployed system changes, e.g.\ a sensor or the software controller in an autonomous driving system.
Although the values computed for the bounds by our propagation method are larger than what would typically be required for safety-relevant applications, they still yield useful information, for example by acting as a safe-guard before entering an expensive real-world testing phase. Continuing the proposed avenue of research may ultimately spawn validation with immensely reduced number of real-world test runs.
For the future, it remains to explore the method in higher-dimensional situations, possibly by extending our parameterization of the joint distribution $p_\alpha$.
While the presented method was derived for static signals, it can be extended to dynamic systems and models by replacing the static signals with embeddings of time-series signals \citep{morrill2020generalised}.
Exploring the proposed method in these regimes, which often include feedback loops, is subject to future research.






%\section{Back Matter}
%There are a some final, special sections that come at the back of the paper, in the following order:
%\begin{itemize}
%  \item Author Contributions (optional)
%  \item Acknowledgements (optional)
%  \item References
%\end{itemize}
%They all use an unnumbered \verb|\subsubsection|.

%For the first two special environments are provided.
%(These sections are automatically removed for the anonymous submission version of your paper.)

%\begin{contributions} % will be removed in pdf for initial submission 
					  % (without ‘accepted’ option in \documentclass)
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
%    Briefly list author contributions. 
 %   This is a nice way of making clear who did what and to give proper credit.
  %  This section is optional.

   % H.~Q.~Bovik conceived the idea and wrote the paper.
    %Coauthor One created the code.
    %Coauthor Two created the figures.
%\end{contributions}


\begin{acknowledgements}
This research was supported by the German Federal Ministry for Economic Affairs and Climate Action under the joint project ``KI-Embedded'' (grant no.\ 19I21043A).
\end{acknowledgements}

% References
%\bibliography{uai2023-template}
\bibliography{reeb_297}
\end{document}
