
\section{Preliminaries}
\label{sec:bottleneck}


\subsection{Decision-Focused Learning by Differentiating Through KKT Conditions}
%\vspace{-0.7em}
%\begin{table}[]
%\centering 
%\renewcommand\arraystretch{0.88}
%\fontsize{9}{11}\selectfont \setlength{\tabcolsep}{0.6em}
%\begin{tabular}{@{}lcccc@{}}
%\toprule
     %Methods     & Decision-aware & Model-mismatch error & Statistical error & Gradient approximation error \\ \midrule
%Two-stage &      \textcolor{red}{\XSolidBrush}          &  %\textcolor{red}{\XSolidBrush}& \textcolor{red}{\XSolidBrush}    &  \textcolor{red}{\XSolidBrush}  \\
%DFL    \cite{donti2017task}   &                &                      &                                    &                              \\
%SO-EBM\cite{kongend}    &                &                      &                                    &                              \\
%\ours     & \textcolor{green}{\CheckmarkBold} & \textcolor{green}{\CheckmarkBold} & \textcolor{green}{\CheckmarkBold} &  \textcolor{green}{\CheckmarkBold} \\ \bottomrule
%\end{tabular}
%\end{table}

%\subsection{Problem Definition}

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=\textwidth]{figure/bottleneck.pdf}
%     \caption{Decision-focused learning directly optimizes the task loss and leads to better decision regret. However, it suffers from three significant bottlenecks.}
%     \label{fig:existing}
% \end{figure}




% A prevalent approach to addressing the above stochastic optimization problem is the two-stage predict-then-optimize framework. This method first learns a probabilistic predictive model $p(y|\mathbf{x};\theta)$ and then employs existing stochastic optimization solvers to obtain the optimal action minimizing the expected cost: $a^{*}(\mathbf{x};\theta)=\argmin_{a\in C}\mathbb{E}_{y\sim p(y|\mathbf{x};\theta)}f(y, a)$. Despite its simplicity and efficiency, the two-stage approach can suffer from suboptimal performance due to the misalignment of prediction loss and optimization loss. 




%DFL combines prediction and optimization into a single end-to-end model, thereby customizing the predictive model for the optimization task. Existing work usually assumes that the predictor \(\mathbf{M}_{\theta}\) takes the features $\mathbf{x}$ as input and produces a point estimate $\mathbf{M}_{\theta}(\mathbf{x})=  \mathbf{\hat y}$, which is then used to parameterize the downstream optimization problem $\argmin_{\mathbf{a}\in C }f(\hat{\mathbf{y}}, \mathbf{a})$, where $f$ is the cost function, $\mathbf{a}$ is the decision variable, $C$ is the feasible space.


In the predict-then-optimize problem, a predictor $\mathbf{M}_{\theta}$
  inputs features $\mathbf{x}$ and outputs a point estimate $\mathbf{\hat y}$. This estimate parameterizes the optimization problem $\argmin_{\mathbf{a}\in C} f(\mathbf{y}, \mathbf{a})$, where $f$ is the cost function, $\mathbf{a}$ is the decision variable, and $C$ is the feasible space.


However, point estimations fail to capture the uncertainty inherent in model predictions \citep{abdar2021review} and the stochastic nature of the problem parameters \citep{schneider2007stochastic}. To address this, we focus on a probabilistic framework, wherein the predictor's output is a probability distribution $p_{\theta}(\mathbf{y}|\mathbf{x})$, rather than a mere point estimate. This allows us to engage in stochastic optimization, where the objective is to find the optimal action $\mathbf{a}^*(\mathbf{x};\theta)$ that minimizes the expected cost, formalized as $\argmin_{\mathbf{a}\in C}\mathbb{E}_{p_{\theta}(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]$. This method more effectively accounts for the uncertainties and variabilities present in the parameters.

Predictions are  then evaluated  based on the decision loss they generate, essentially the cost function's value using the true parameters $\mathbf{y}$. For a dataset $\mathcal{D}=\{\mathbf{x}_i,\mathbf{y}_i \}_{i=1}^N$, the goal is to train a  model \(\mathbf{M}_{\boldsymbol{\theta}}\) to minimize the decision loss:
\begin{align}  \textstyle \theta^*=\argmin_{\theta}\frac{1}{N}\sum_{i=1}^Nf(\mathbf{y}_i, \mathbf{a}^*(\mathbf{x}_i;\theta)).
\end{align}
By directly optimizing the decision loss, the gradient of the model parameters can be calculated using the chain rule: $\frac{\mathrm{d} f(\mathbf{y},\mathbf{a}^{*}(\mathbf{x};\theta))}{\mathrm{d} \theta}=\frac{\mathrm{d} f(\mathbf{y},\mathbf{a}^{*}(\mathbf{x};\theta))}{\mathrm{d} \mathbf{a}^{*}(\mathbf{x};\theta)}\frac{\mathrm{d} \mathbf{a}^{*}(\mathbf{x};\theta)}{\mathrm{d} {\theta}}.$ To compute the Jacobian $\frac{\mathrm{d} {\mathbf{a}^{*}(\mathbf{x};\theta)}}{\mathrm{d} \theta}$ for backpropagation, OptNet \citep{amos2017optnet}  assumes quadratic optimization objectives and differentiates through the KKT conditions using the implicit function theorem. Later, cvxpylayers \citep{agrawal2019differentiable} extends it to more general cases of convex optimization using disciplined parameterized programming (DPP) grammar.
% This way, OptNet can obtain the Jacobian by solving the optimization problem along with a set of linear equations in each training iteration.
% Later works \cite{agrawal2019differentiable, sun2023alternating} extend this technique to general convex cost functions.





% \begin{figure}[t]
%     \centering
%     \begin{subfigure}[t]{0.3\textwidth}
%         \includegraphics[width=\textwidth]{figure/dfl.pdf}
%         \caption{Model mismatch error}
%         \label{fig:figure1}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.3\textwidth}
%         \includegraphics[width=\textwidth]{figure/gmm_components.pdf}
%         \caption{Sample average approximation error}
%         \label{fig:figure2}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.3\textwidth}
%         \includegraphics[width=\textwidth]{figure/quadratic.pdf}
%         \caption{Gradient approximation error}
%         \label{fig:figure2}
%     \end{subfigure}
%     \caption{Bottlenecks of Decision-focused Learning}
%     \label{fig:figures}
% \end{figure}


\begin{figure*}
    \centering
    \includegraphics[width=0.9\textwidth]{figure/v7.pdf}
    \caption{The proposed attention-based network architecture of \ours. The network contains an encoder and a set of learnable attention points $\{\mathbf{k}_s,\mathbf{v}_s\}_{s+1}^S$. Given an input feature $\mathbf{x}$, the encoder first project it to query embedding space and then compute the attention weights by its dot product with the key embeddings. The final function value $g(\mathbf{x},\mathbf{a})$ is a weighted combination of $f(\mathbf{v},\mathbf{a})$.  The designed network architecture can effectively reduce the bias error in Proposition~\ref{prop:1}.}
    \label{fig:overall}
    %\vspace{-0.5em}
\end{figure*}


\subsection{Bottlenecks under the Probabilistic Setting}

Although DFL by differentiating through KKT condition can achieve better decisions compared to the two-stage learning, they have three significant bottlenecks under the probabilistic setting.

\noindent\textbf{Bottleneck 1: Model Mismatch Error.} Real-world applications often involve complex and multi-modal probability distributions $p(\mathbf{y}|\mathbf{x})$~\citep{kong2023uncertainty}. One prominent example is the wind power forecasting task, where the environment exhibits high uncertainty due to the dynamic and stochastic nature of wind patterns. Factors such as changing weather conditions, terrain, and turbulence can significantly affect the true distribution of wind power, making it highly intricate and challenging to model accurately.


However, existing approaches \citep{donti2017task, kongend} tend to assume simple distributions, \eg, isotropic Gaussian distribution, for computational feasibility. However, this assumption can lead to considerable misalignment between the model's parameterized distribution and the true underlying distribution in tasks with high uncertainty. This mismatch results in poor approximations and reduced decision-focused learning performance. Fig.~\ref{fig:existing}(a) illustrates this issue using a ground-truth distribution composed of a mixture of three Gaussians. As we can see, the performances of DFL approaches suffer due to the model mismatch error, which is particularly pronounced in tasks with highly uncertain environments.

\noindent\textbf{Bottleneck 2: Sample Average Approximation Error.} In complex optimization problems, closed-form expressions for expectations might be unavailable, necessitating the use of sample average approximation \citep{kim2015guide,verweij2003sample,kleywegt2002sample}. Although adopting a more expressive distribution, such as a mixture density network, could potentially improve performance, doing so introduces another issue—sample approximation error. As shown in Fig.~\ref{fig:existing}(b), when dealing with intricate distributions, increasing the sample size reduces the gradient variance slowly but demands substantially higher computational resources and longer running times. 

\noindent\textbf{Bottleneck 3: Gradient Approximation Error.} The KKT condition can only be applied to convex objectives. However, many real-world applications involve complicated non-convex objectives. Though \citet{perrault2019decision, wang2020scalable} propose to approximate the non-convex objectives by a quadratic function around a local minimum to approximate $\frac{\mathrm{d} \mathbf{a}^*}{\mathrm{d} \theta}$ (Fig.~\ref{fig:existing}(c)), the inaccurate gradients may be aggregated during the training iterations and thus lead to poor decisions. 

The first two errors occur during both training and testing, whereas gradient approximation errors occur only during training. Recently, several methods~\citep{kongend,shah2022learning, shah2023leaving} have proposed surrogate losses for DFL to avoid differentiating through KKT conditions. However, they still suffer from the first two bottlenecks.

It should be noted that when the objective function is linear, the expectation of a linear function has a closed-form expression and only requires estimating the mean of a distribution. Therefore, the model does not suffer from these bottlenecks. As a result, SPO \citep{elmachtoub2022smart} proves that it converges to a decision with optimal expected costs regardless of the distribution. In our paper, we consider a more complex setting where estimating the expected cost requires the entire predictive distribution.

