\section{Introduction}
%\begin{itemize}
%\item  The parameters of many decision making problems are contextual and stochastic; they need to be predicted from features. Two-stage model suffer from suboptimal performance; smaller prediction error does not lead to smaller decision-making regret. 

%\item DFL reduce the gap by imbuing a differentiable optimization solver into the network and thus tailoring the training objective for the downstream task. However, existing DFL methods are all model-based. They suffer from (1) model mismatch errors and (2) sample average approximation. (3) Gradient appropximation 

%\item We propose the first model-free DFL to address these two issues. We directly approximate the expectation of the objective function. Function apprixmation is challenging and data hungry. We take the mode-based parametrization to design the network architecture.

%\item Our main contributions: (1) model-free objective (2) Attention-based model architecture design inspired by model-based parameterization (3) Experimental results.
%\end{itemize}

% The predict-then-optimize paradigm is a common approach for real-world decision-making problems, where the parameters of the optimization problem are unknown, context-dependent, and must be predicted from observed features. For instance, in e-commerce, companies must adjust their product recommendations to maximize expected revenue, which requires forecasting future consumer preferences based on their past behavior. Similarly, when planning renewable energy infrastructure, predictions about future energy production from wind or solar sources are crucial. Another example can be seen in urban planning, where decisions about public transport routes must be optimized by predicting uncertain future factors, such as population growth and changes in commuting patterns, to minimize long-term operational costs and maximize efficiency.

% Many decision-making processes can be formulated as solving 
%  a stochastic optimization problem:
% $$\argmin_{\mathbf{a} \in C}\mathbb{E}_{\mathbf{y}}[f(\mathbf{y},\mathbf{a})],$$  where $\mathbf{y}$ 
% represents the parameters of the optimization problem, $\mathbf{a}$ denotes the decision variables within a feasible space $C$, and $f$ is the cost function to be optimized. In many real-world applications, the parameters $\mathbf{y}$ are \emph{unknown} and \emph{context-dependent}, requiring inference from correlated features $\mathbf{x}$. As an example, consider personalized medicine, where $\mathbf{y}$ might represent patient-specific responses to different treatments, $\mathbf{a}$ could be the selected treatments for each patient, and $f$ would measure the discrepancy between the actual patient responses and the treatment outcomes. The task is to determine the optimal $\mathbf{a}$ to minimize the cost $f$. In this context, patient-specific responses $\mathbf{y}$ are typically unknown and must be predicted using features $\mathbf{x}$ such as patient demographics, medical history, and genetic information.

Many decision-making problems are fundamentally optimization problems that require the minimization of a cost function, which often depends on parameters that are both \textit{unknown} and \textit{context-dependent}. Typically, these parameters are estimated using observed features. For instance, hedge funds regularly recalibrate their portfolios to maximize expected returns, which involves predicting the future return rates of various stocks. Similarly, in personalized medicine, the selection of treatments for individual patients must predict unique responses to ensure optimal outcomes.



% Many decision-making problems are optimization problems, where one needs to minimize a cost function that involves \textit{unknown} and \textit{context dependent} parameters. Usually, these parameters are predicted from some observed features.  For example, hedge funds need to continuously adjust their portfolio for maximal expected return, by forecasting future return rates of different stocks; and in personalized medicine, treatments should be selected for each patient, by accounting for patient-specific responses.

% In this paper, we assume a dataset $\mathcal{D}=\{\mathbf{x}_i,\mathbf{y}_i \}_{i=1}^N$ drawn from the joint distribution of the features and parameters of the optimization problem $p(\mathbf{x},\mathbf{y})$. Our objective is to learn a decision-making model with parameter $\theta$ that takes features $\mathbf{x}$ as input and produces optimal decisions $\mathbf{a}^{*}$ at $\mathbf{x}$. The model should be trained to output optimal decisions that minimize the expected decision cost under the joint distribution of $(\mathbf{x},\mathbf{y})$:
% \begin{align}
% \argmin_{\theta}
% \mathbb{E}_{p(\mathbf{x},\mathbf{y})}[f(\mathbf{y},\mathbf{a}^{*}(\mathbf{x};\theta))].
% \label{eq:dfl_obj}
% \end{align}


\begin{figure*}[t]
    \centering  \includegraphics[width=0.9\textwidth]{figure/pipeline_v4.pdf}
    \caption{Decision-focused learning~\citep{donti2017task} directly optimizes the task loss and leads to better decision regret. However, it suffers from three significant bottlenecks. More illustrations are in Section \ref{sec:bottleneck}.}
    \label{fig:existing}
    %\vspace{-1em}
\end{figure*}

Given the growing capacity to train powerful deep learning models, a common strategy for this problem is the two-stage pipeline. This approach first learns a predictive model for unknown parameters using a generic loss function (\eg, negative log-likelihood) during the prediction stage, and then applies the model’s outputs in a downstream optimization problem. Despite its widespread use, this pipeline implicitly assumes that better predictive accuracy—measured by the prediction loss—translates to better optimization performance. However, this assumption often breaks down, as prediction errors can have non-uniform effects on the optimization objective.
To address this issue, \emph{Decision-Focused Learning (DFL)}~\citep{donti2017task, wilder2019melding,  wang2020automatically, sun2023alternating, yan2021surrogate, rodriguez2024right} integrates the prediction and optimization stages into a single end-to-end model. A prominent line of work leverages the implicit function theorem and the KKT conditions to differentiate through the optimization layer~\citep{donti2017task}, enabling the learning process to align predictive outputs directly with decision quality. This results in models that are trained explicitly for decision-making, often framed as regret minimization.




% However, existing DFL usually assume that the predictor only produce a point estimate of the problem parameters, which cannot encode the potential risk in the prediction and inherent randomness. A more robust approach is to learn a probabilistic predictor which produces a probability distribution of the problem parameters and then solve the expectation of the cost function.
Despite its promising results, DFL via differentiation through KKT conditions faces several critical bottlenecks in the probabilistic setting, where the predictive model outputs a distribution rather than a point estimate:
(1) Model mismatch error: real-world applications often operate in highly uncertain environments and involve complex, multimodal probability distributions. In contrast, DFL by differentiating through KKT conditions requires simple parameterized distribution models for computational feasibility, leading to a mismatch. (2) Sample average approximation error: When there is no closed-form expression for the expected optimization objective, we typically draw a finite number of samples from the distribution for averaging, which will introduce extra statistical errors. (3) Gradient approximation error:  the KKT condition is only a sufficient condition for optimal solution of convex problem, which is unable to characterize the optimal solution in non-convex setting, and thus, will lead to inaccuracies that cumulatively result in lower decision quality. Recent works \citep{kongend, shah2022learning, shah2023leaving} have proposed surrogate objectives to bypass the challenges of gradient computation. However, these approaches are still model-based and suffer from the other two bottlenecks. While SPO~\citep{elmachtoub2022smart} generally converges to a decision with optimal expected costs regardless of the distribution, it is restricted to linear objectives.


%The recent SO-EBM model \cite{kongend} proposes a surrogate objective to bypass the gradient computation challenges. However, it is still model-based and suffers from the other two bottlenecks. To the best of our knowledge, there is no existing DFL method that can address all three bottlenecks within a single model.



We propose \ours, the first distribution-free decision-focused learning method, to mitigate the three bottlenecks and handle complex objectives beyond the linear class.  Instead of relying on a task-specific forecaster that necessitates precise model assumptions, we propose to learn directly the expectation of the optimization objective function from the data. Upon learning, we can obtain the optimal decision by maximizing the learned expected function within the feasible space. In order to ensure that the network architecture lies within the true model class and minimize bias error, we have developed an attention-based network architecture that emulates the distribution-based parameterization of the expected objective. This attention architecture also preserves the convexity of the original optimization objective.  In contrast to the two-stage model, \textit{\ours} is decision-aware. Compared to DFL methods, \textit{\ours} avoids model mismatch error, gradient approximation error, and sample average approximation error at test time.

Our main contributions can be summarized as follows: (1) We propose a distribution-free training objective for DFL. It mitigates the three bottlenecks of existing methods under the probabilistic setting. (2) We propose an attention-based network architecture inspired by the distribution-based parameterization to ensure the network architecture is within the true model class. (3) Experiments on two synthetic datasets and three real-world datasets show that our method can achieve better performance than existing DFL methods.
