% \begin{figure}
%     \centering
%     \includegraphics[width=0.95\textwidth]{figure/overall.pdf}
%     \caption{Network Architecture of \ours.}
%     \label{fig:overall}
% \end{figure}




\section{Distribution-Free Decision-Focused Learning}
%\Bo{please add some summarization here. }
In this section, we introduce \ours which mitigates all the three bottlenecks within a single model. We first introduce the distribution-free training objective which transforms DFL into a function approximation problem. Then, we design an attention-based architecture inspired by the distribution-based parameterization to reduce the bias error. Finally, we discuss how to obtain the optimal decision during inference.

\subsection{Distribution-Free Training Objective}
Existing DFL methods primarily rely on a distribution-based approach. These techniques learn a forecaster that outputs probability distribution $p(\mathbf{y}|\mathbf{x})$ based on various model assumptions. However, a more straightforward approach is to estimate the expected cost function $\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]$ directly from the training data $\mathcal{D}=\{{\mathbf{x}_i, \mathbf{y}_i\}}_{i=1}^N$.

The cornerstone of our method is the observation that the expected cost objective is only a function of $\mathbf{a}$ and $\mathbf{x}$, which is represented as $g(\mathbf{x}, \mathbf{a})=\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]$. We propose a direct approach to learn a neural network with parameters $\theta$ to match the expected cost function $\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}  [f(\mathbf{y},\mathbf{a})]$. Our objective is to minimize the mean square error (MSE) between the fitted function $g(\mathbf{x},\mathbf{a})$ and the cost function $f(\mathbf{y},\mathbf{a})$ sampled from $p(\mathbf{x}, \mathbf{y})$:
\begin{align}
g^*(\mathbf{x},\mathbf{a})= \argmin_{g} \mathbb{E}_{(\mathbf{x},\mathbf{y})} \mathbb{E}_a[g(\mathbf{x},\mathbf{a})-f(\mathbf{y},\mathbf{a})]^2.
\label{eq:objective}
\end{align}
The proposed training objective can be efficiently optimized using stochastic gradient-based methods such as ADAM \citep{kingma2015adam}. 

In the ideal case, when we have infinite training data and model capacity, the optimal solution $g^*$ of Eq.~\ref{eq:objective} is the ground-truth conditional expectation $\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]$. Upon learning the optimal function, the optimal action can be derived by maximizing the fitted function $\mathbf{a}^*=\argmin_{\mathbf{a}\in C} g_{\theta}(\mathbf{x}, \mathbf{a})$. However, in practical situations where training data and model capacity are limited, we obtain the expected error on the test set as the following proposition.
\begin{proposition}\label{prop:1}
% The optimal solution $g^*$ is the conditional mean $\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]$ 
The expected MSE of the optimal solution $g^*$ on the test set is: 
\begin{align}
 \text{MSE}_{\rm test}  &=  \underbrace{\mathbb{E}_{\mathcal{D}'}\left [ \left(g^*_{\mathcal{D}'}(\mathbf{x,\mathbf{a}})- \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]   \right)^2 \right]}_{\text{Bias}}\nonumber
 \\ &+ \underbrace{\mathbb{E}_{\mathcal{D}'}\left[ \left(g^*_{\mathcal{D}'}(\mathbf{x,\mathbf{a}})-  \mathbb{E}_{\mathcal{D}'}[g^*_{\mathcal{D}'}(\mathbf{x,\mathbf{a}})]  \right)^2 \right]}_{\text{Variance}}, \nonumber
\end{align}
where $\mathcal{D}'$ denotes the training dataset  augmented with the sampled actions $\mathbf{a}$, and $g^*_{\mathcal{D}'}(\mathbf{x},\mathbf{a})$ denotes the function fitted on the dataset $\mathcal{D}'$.

Proof. See Appendix~\ref{s:Prop1} for a detailed proof.
\end{proposition}

\noindent \textbf{Sampling Action from the Constrained Space}. In practice, it's unnecessary to fit the true objective across the entire Euclidean space. Instead, we only need to sample from the constrained space $C$. There are several strategies for this. One simple approach is to sample from a relaxed version of the constrained space,such as an outer bounding box that encloses $C$. This allows us to sample each dimension of $\mathbf{a}$ independently from a uniform distribution. Moreover, many predict-then-optimize problems are resource allocation problems where the decision variable $\mathbf{a}$ is a simplex; for a simplex, we can directly sample from the Dirichlet distribution. Appendix~\ref{s:sampling} provides more illustrations on the relaxed constrained sampling. Alternatively, we can employ Markov chain Monte Carlo (MCMC) methods to uniformly sample within $C$, such as Ball Walk~\citep{lovasz1990mixing} and the hit-and-run algorithm~\citep{belisle1993hit, lovasz1999hit}. However, these methods typically incur higher computational costs.



In contrast to traditional DFL, our framework effectively transforms decision-focused learning into a function approximation problem, circumventing the complexities of solving and differentiating through the optimization problem. This approach avoids both model mismatch error and gradient approximation error. While we do not claim to fully address the sample average approximation error during training, as we still rely on finite data to estimate the expected cost function, we can avoid this error at inference time, see Section~\ref{sec:inference}. 

% \Bo{add more discussion about the solution to the learning objective and the conditional mean $E_{y|x}[f(y, a)]$. In fact, I think we may need to rearrange the content in this section. After the loss function in Eq. 2, the justificatio in Prop 1 should appear immediately. Then, we discuss action sampling. Then, consider the bias, leading to next section. }

As we can see from Proposition~\ref{prop:1}, the test MSE consists of the bias and variance terms. The variance term will be reduced by sampling more data. To ensure that the bias error term approaches zero with more training data, it is crucial to keep the network architecture within the model class. To tackle this challenge, we introduce an attention-based network architecture in the following subsection. %This architecture, inspired by the distribution-based parameterization of the expected objective $\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]$, is designed to learn the expected function effectively in a data-driven manner.

% \subsection{Model-based Parameterization}

% We start by expanding the expected objective $\mathbb{E}_{p(y|x)}f(y,a)$ with a kernal smoothed representation of the probability, \ie,
% $p(y|x)=\frac{k(x,x_s)k(y,y_s)}{\sum_s k(x,x_s)}$.


% This inspires us to use
% a set of learnable attention points $\{\mathbf{k}_s, \mathbf{y}_s\}_{s=1}^S$.
%  For an input $x$, the encoder will first project it to the query embedding space $\mathbf{q}_k$

% XX has discussed the connection between attention and kernel. 

% By setting the value function $v_s(a)=f(y_s,a)$

% We first start by writing 
% \begin{align}
%     g(x,a)=\mathbb{E}_{p(y|x)}f(y,a)=&\int \frac{\sum_{i=1}^NK(x-x_i)K(y-y_i)}{\sum_{j=1}^N K(x-x_j)}f(y,a)dy\\
%    =&\sum_{i=1}^N\frac{K(x-x_i)\int K(y-y_i)f(y,a)dy}{K(x-x_j)}\\
%    = & \sum_{i=1}^N\frac{K(x-x_i) f(y_i,a)}{\sum_{j=1}^NK(x-x_j)}\\
%    = & \frac{\mathbf{q_i}^T\mathbf{k_i}}{\sqrt{d_k}}f(y_i, a)
% \end{align}

% $v$ parameterized by $f(y_i)$


% We initialize the embeddings of a set of keys $\{\mathbf{k}_l\}_{l=1}^L$


% Attention has been XXX.. 

% To speed up the training procedure, one can initialize the with randomly selected labels from the training dataset.







\subsection{Distribution-Based Parameterization}
The key of our architecture design is to mimic the distribution-based parameterization of the expected cost function. Since our training objective bypass the need of    solving and differentiating through the stochastic optimization problem, we can adopt an expressive non-parametric distribution with kernel conditional mean embedding (CME) to parameterize our model. The proposed network  architecture can lead to zero bias error in Proposition~\ref{prop:1}



\begin{table}
\small
\centering
\begin{tabular}{c c c}
\toprule[1.5pt]
Variable & $\mathbf{x}$ & $\mathbf{y}$ \\
Domain & $\mathcal{X}$ & $\mathcal{Y}$ \\
Kernel & $\mathcal{R}_{\mathbf{x}}(\mathbf{x},\mathbf{x}')$ & $\mathcal{R}_{\mathbf{y}}(\mathbf{y},\mathbf{y}')$ \\
Feature map & $\mathcal{R}_{\mathbf{x}}(\mathbf{x},\cdot)$ & $\mathcal{R}_{\mathbf{y}}(\mathbf{y},\cdot)$ \\
RKHS & $\mathcal{G}$ & $\mathcal{F}$ \\
\bottomrule[1.5pt]
\end{tabular}
\caption{Table of Notations}
\label{table:notations}
%\vspace{-1.8em}
\end{table}

CME \citep{song2009hilbert, song2013kernel} is a powerful tool to compute the expectation of a function in the reproducing kernel Hilbert space (RKHS), without the curse of dimensionality. Let $\mathcal{F}$ be a RKHS over the domain of $\mathbf{y}$ with kernel function $\mathcal{R}_\mathbf{y}(\mathbf{y},\mathbf{y}')$ and inner product $\langle \cdot, \cdot \rangle_{\mathcal{F}}$. 
For a particular $\mathbf{a}$, we denote the corresponding function as $f_\mathbf{a}(\mathbf{y})$. CME projects the conditional distribution to its expected feature map $\mu_{\mathbf{y}|\mathbf{x}}\triangleq \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[\mathcal{R}_{\mathbf{y}}(\mathbf{y}, \cdot)]$ and evaluates 
the conditional expectation of any RKHS function, $f_\mathbf{a} \in \mathcal{F}$, as an inner product in $\mathcal{F}$ using the reproducing property:
\begin{align}
\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f_{\mathbf{a}}]&=\int p(\mathbf{y}|\mathbf{x})\langle \mathcal{R}_y(\mathbf{y},\cdot), f_{\mathbf{a}} \rangle_{\mathcal{F}}d\mathbf{y} \nonumber \\
&= \left\langle \int p(\mathbf{y}|\mathbf{x})\mathcal{R}_y(\mathbf{y}, \cdot)\mathrm{d}\mathbf{y}, f_{\mathbf{a}} \right\rangle_{\mathcal{F}} =   \langle \mu_{\mathbf{y}|\mathbf{x}}, f_{\mathbf{a}}\rangle_{\mathcal{F}}. \nonumber
\end{align}
Assume that for all $f_{\mathbf{a}}\in \mathcal{F}$, the conditional expectation $\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f_{\mathbf{a}}(\mathbf{y})]$ is an element of the RKHS over the domain of $\mathbf{x}$, the conditional embedding can be estimated with a finite dataset $\{\mathbf{x}_s, \mathbf{y}_s\}_{s=1}^S$ as
$\hat{\mu}_{\mathbf{y}|\mathbf{x}}=\sum_{s=1}^S\beta_s(\mathbf{x})\mathcal{R}_{\mathbf{y}}(\mathbf{y}_s,\cdot)$,
 where $\beta_s$ is a real-valued weight and can be computed with matrix calculation (see more details about this computation in Appendix~\ref{s:Background}). 
 
 One advantage of CME is that $\hat{\mu}_{\mathbf{y}|\mathbf{x}}$ can converge to $\mu_{\mathbf{y}|\mathbf{x}}$ in the RKHS norm at an overall rate of $\mathcal{O}(S^{-\frac{1}{2}})$ \citep{song2009hilbert}, which is independent of the input dimensions. This property let CME works well in the high-dimensional space.  With the estimated CME, the conditional expectation can be computed by the reproducing property:
 \begin{align}
\mathbb{E}_{p(\mathbf{y}|\mathbf{x})}[f_{\mathbf{a}}(\mathbf{y})] &=\langle \hat{\mu}_{\mathbf{y}|\mathbf{x}}, f_{\mathbf{a}}\rangle_{\mathcal{F}} = \left\langle  \sum_{s=1}^S\beta_s(\mathbf{x})\mathcal{R}_{\mathbf{y}}(\mathbf{y}_s, \cdot), f_{\mathbf{a}} \right\rangle_{\mathcal{F}}
\nonumber\\& = \sum_{s=1}^S\beta_s(\mathbf{x})f_{\mathbf{a}}(\mathbf{y}_s).
\label{eq:cme}
\end{align}
As shown in Eq.~\ref{eq:cme}, the formulation is essentially a weighted combination of $f_\mathbf{a}(\mathbf{y}_s)$, where the weights are conditioned on the input features $\mathbf{x}$. This observation inspires us to leverage attention-based parameterization to represent the function $g(\mathbf{x},\mathbf{a})$. The attention mechanism forms the foundation of the transformer architecture \citep{vaswani2017attention} and has been successfully utilized across various deep learning applications \citep{kenton2019bert,brown2020language,dosovitskiy2021an}. %Recent works \cite{zhang2022analysis, tsai2019transformer} have explored the connection between attention and kernel, setting the theoretical groundwork for our approach.






Inspired by this, we introduce a set of learnable attention points $\{{\mathbf{k}_s, \mathbf{v}_s}\}_{s=1}^S$, where $\mathbf{k}$ is the key embedding and $\mathbf{v}$ is the corresponding value embedding. For an input $\mathbf{x}$, the encoder first maps it to the query embedding space $\mathbf{q}$ and compute the attention weights by its product with the key embeddings. We set the value function as $f(\mathbf{v}_s,\mathbf{a})$ and, consequently, reformulate the function $g(\mathbf{x},\mathbf{a})$ using the softmax attention mechanism \citep{vaswani2017attention}:
\begin{align}
g(\mathbf{x},\mathbf{a})= &\text{Softmax}\left(\left[\frac{\mathbf{q}(\mathbf{x})^\top\mathbf{k}_1}{\sqrt{d}}, \cdots, \frac{\mathbf{q}(\mathbf{x})^\top\mathbf{k}_S}{\sqrt{d}}\right]\right)^\top \nonumber\\&[f(\mathbf{v}_1, \mathbf{a}), \cdots, f(\mathbf{v}_S, \mathbf{a})],
\label{eq:attention}
\end{align}
where $d$ is the dimension size of the key embeddings and value embeddings. %In our framework, the value function is parameterized by $f(\mathbf{v}_s,\mathbf{a})$, which plays a crucial role in function learning.
\begin{proposition}\label{prop:my_proposition}
It holds for any $\mathbf{x}$ and $\mathbf{a}$, the function $g(\mathbf{x},\mathbf{a})$ defined by the softmax attention in Eq.~\ref{eq:attention} $\mathbb{E}_{\hat{p}_{\mathcal{R}}(\mathbf{y}|\mathbf{x})}[f(\mathbf{y},\mathbf{a})]=g(\mathbf{x},\mathbf{a})$. Here,
$\hat{p}_{\mathcal{R}}(\mathbf{y}|\mathbf{x})$ is a  parameterization restriction of $p(\mathbf{y}|\mathbf{x})$. 

Proof. See Appendix~\ref{s:Prop2} for a detailed proof.
\label{prop:2}
\end{proposition}       
From Proposition \ref{prop:2}, it is evident that with the attention-based network architecture, we can guarantee that our learned expected function resides within the true model class. To speed up the training procedure, one can initialize the value embeddings of the attention points with randomly selected labels from the training dataset. This approach provides a reasonable starting point for the model and reduces the time it takes for the model to converge to a solution. The training procedure of \ours is given in Algorithm~\ref{alg:training}.

\noindent\textbf{Remark.} Our proposed attention-based network architecture represents a parameterization of $p(\mathbf{y}|\mathbf{x})$, drawing similarities with the two-stage model and DFL. Compared with the two-stage model, we learn the expected cost function to make \ours decision-aware. Compared with DFL, we do not have to solve the stochastic optimization problem during learning. As a result, we can adopt an expressive nonparametric distribution with CME to parameterize $p(\mathbf{y}|\mathbf{x})$ to avoid the model mismatch error.








\subsection{Model Inference}
\label{sec:inference}

At test time, we can obtain the optimal decision by maximizing the learned expected cost $\argmin_{\mathbf{a}\in C}g(\mathbf{x},\mathbf{a})$. The final representation of $g(\mathbf{x}, \mathbf{a})$ is a weighted combination of $f(\mathbf{v}_s, \mathbf{a})$ with different value embeddings. Another benefit of the proposed attention-based network architecture is that it can preserve the convex property of the cost function.
\begin{proposition}
    As long as $f(\mathbf{y},\mathbf{a})$ is a convex function with respect to $\mathbf{a}$, $g(\mathbf{x},\mathbf{a})$ is a convex function with respect to $\mathbf{a}$.
    
Proof: This is a direct consequence of the theorem that a convex combination of convex functions remains a convex function
\end{proposition}

When the original objective is convex, we can use any existing black-box convex solver \citep{diamond2016cvxpy, agrawal2018rewriting, gurobi}. For non-convex problem, we can use projected gradient descent.




% Note that \ours directly leverages the learned expected cost $g(\mathbf{x},\mathbf{a})$, eliminating the need for sampling at test time. Therefore, we avoid the sample average approximation error during inference.

Although Eq.~\ref{eq:objective} involves sampling \(\mathbf{x}, \mathbf{y}\) during training, this introduces generalization error due to the finite size of the training dataset. Crucially, this generalization error is distinct from the SAA error, which arises in existing methods that require sampling from a predicted distribution (e.g., a Gaussian with learned parameters) to estimate an expected objective. In such cases, the generalization error in the predictive model is further compounded by the additional variance introduced through sampling, resulting in compounded inaccuracies.

In contrast, our method learns the expected objective \(g(\mathbf{x}, \mathbf{a})\) directly and does not require sampling at inference time, thereby eliminating the additional SAA error. Nonetheless, like all learning-based methods, it remains subject to generalization error stemming from limited training data.






