\section{Additional Related Work}
SO-EBM \citep{kongend} proposes a surrogate learning objective by maximizing the likelihood of the pre-computed optimal decision within an energy-based probability parameterization. LODL \citep{shah2022learning, shah2023leaving} and LANCER~\citep{zharmagambetov2023landscape}  approximate the decision-focused loss with a quadratic function or a neural network. \ours is different from them:  (1) They assume a deterministic setting while we assume the problem parameter $\mathbf{y}$ is a probability distribution.
(2) They approximate the decision loss which is a function of the problem parameter $\mathbf{y}$. In contrast,
\ours directly learns the expected cost function which remains independent of $\mathbf{y}$. (3) They still relies on initially learning a forecaster to infer $\mathbf{y}$ from $\mathbf{x}$. Consequently, they remain susceptible to both model mismatch error and sample average approximation error in our probabilistic setting.  Recently, \citet{bansal2023taskmet} proposes TaskMet with the motivation to simultaneously optimize predictive loss and decision loss, rather than addressing the three bottlenecks.


Several other works have focused on linear objectives, where DFL through KKT condition may encounter singular value issues. To address this, the SPO+ loss \citep{elmachtoub2022smart} evaluates prediction errors relative to optimization objectives using the subgradient method. The approach by \citet{wilder2019melding} incorporates a quadratic regularization term for smoothing. Meanwhile, \citet{mandi2020interior} introduces a log barrier regularizer and differentiates through the homogeneous self-dual embedding. In contrast, our method is crafted for a broader range of objectives.


When the optimization problem is discrete, differentiating through the optimization layer is even more challenging since the gradient is ill-defined in the discrete domain. Various solutions have been proposed, such as tackling the discrete challenge via interpolation \citep{poganvcic2019differentiation}, perturbation \citep{niepert2021implicit, berthet2020learning}, subgradient methods \citep{mandi2020smart}, and cutting planes \citep{ferber2020mipaal}. Our method is directly applicable to the discrete setting and we leave it for future exploration.


% There are also approaches where a policy network is trained to directly map from the input to the solution of the optimization problem using supervised or reinforcement learning \cite{vinyals2015pointer,khalil2017learning, li2016learning}, in the learning to optimize community. Though these methods no longer suffer from these three bottlenecks, their performance are often inferior to DFL in the predict-then-optimize problem as they ignore the algorithmic structure of the problem and typically require a large amount of data.
