

\begin{figure*}[!t]
    %\hspace{-1.4cm}
    \centering
    \includegraphics[width=0.9\textwidth]{figure/Synthetic_PE_v2.pdf}
    \caption{The gap of the model's decision regret from the lower bound of the decision regret on the synthetic data for both the convex and non-convex objectives. `PE' denotes that the forecaster only produces a point estimate for the problem parameter.}
    \label{fig:synthetic}
\end{figure*}

\begin{figure*}[!ht]
    %\hspace{-1.4cm}
    \centering
    \includegraphics[width=0.95\textwidth]{figure/landscape_new.pdf}
    \caption{Randomly initialized landscape, \ours recovered landscape and the ground-truth landscape on the synthetic data. The landscape is conditioned on an input feature sampled from the test set.}
    \label{fig:landscape}
    %\vspace{-0.5em}
\end{figure*}



% \begin{figure*}[!ht]
%     %\hspace{-1.4cm}
%     \centering
%     \includegraphics[width=0.95\textwidth]{figure/landscape_new.pdf}
%     \caption{Randomly initialized landscape, \ours recovered landscape and the ground-truth landscape on the synthetic data. The landscape is conditioned on an input feature sampled from the test set.}
%     \label{fig:landscape}
%     %\vspace{-0.5em}
% \end{figure*}



%\vspace{-0.8em}
\section{Experiments}
In this section, we empirically evaluate \ours and conduct experiments in both synthetic and real-world scenarios. Finally, we perform ablation studies to show the effect of each model design in \ours.
\subsection{Synthetic Problems}
To highlight the ability to learn the true expected objective, we first validate our method on a synthetic dataset where the true underlying model is known to us. To simulate the multi-modal scenario in the real world, we generate 5000 feature-parameter pairs using a Gaussian mixture model with three components (3 GMM). We consider both convex and non-convex objectives. The details of the data generation process and the objectives are provided in the Appendix~\ref{s:synthetic}.
%\Bo{can we shrink the size of the problems and put them side-by-side here?}
% \begin{small}
% \begin{align}
%    &\text{minimize}_{\mathbf{a}\in \mathbb{R}^2} \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\sum_{i=1}^2[(y_i-a_i)_{+} + (a_i-y_i)_{+} + (y_i-a_i)_{+}^2 + (a_i-y_i)_{+}^2]  \nonumber\\
%    &\text{subject to}  \quad \mathbf{G}\mathbf{a} \le \mathbf{h}\\
%    &\text{minimize}_{\mathbf{a} \in \mathbb{R}^2} \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\sum_{i=1}^2[(y_i-a_i)_{+}^2 + (a_i-y_i)_{+}^2 + a_i^3] \nonumber\\
%     &\text{subject to}  \quad \mathbf{A}\mathbf{a} \le \mathbf{b}
% \end{align}
% \end{small}
% \begin{align}
%    &\text{minimize}_{\mathbf{a}\in \mathbb{R}^2} \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\sum_{i=1}^2[(y_i-a_i)_{+} + (a_i-y_i)_{+} + (y_i-a_i)_{+}^2 + (a_i-y_i)_{+}^2]  \nonumber\\
%    &\quad \text{subject to}  \quad \mathbf{G}\mathbf{a} \le \mathbf{h}
% \end{align}
% \begin{align}
%    & \text{minimize}_{\mathbf{a} \in \mathbb{R}^2} \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\sum_{i=1}^2[(y_i-a_i)_{+}^2 + (a_i-y_i)_{+}^2 + a_i^3] \nonumber\\
%     &\text{subject to}  \quad \mathbf{A}\mathbf{a} \le \mathbf{b}
% \end{align}
% \begin{scriptsize}
%     \begin{alignat}{2}
%    &\text{minimize}_{\mathbf{a}\in \mathbb{R}^2} \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\sum_{i=1}^2[(y_i-a_i)_{+} + (a_i-y_i)_{+} + (y_i-a_i)_{+}^2 + (a_i-y_i)_{+}^2]  &&\quad \text{subject to}  \quad \mathbf{G}\mathbf{a} \le \mathbf{h} \\
%    &\text{minimize}_{\mathbf{a} \in \mathbb{R}^2} \mathbb{E}_{p(\mathbf{y}|\mathbf{x})}\sum_{i=1}^2[(y_i-a_i)_{+}^2 + (a_i-y_i)_{+}^2 + a_i^3] &&\quad \text{subject to}  \quad \mathbf{A}\mathbf{a} \le \mathbf{b}
% \end{alignat}
% \end{scriptsize}

% \begin{figure*}[!ht]
%     %\hspace{-1.4cm}
%     \centering
%     \includegraphics[width=0.95\textwidth]{figure/bars_v4.pdf}
%     \caption{The gap of the model's decision regret from the lower bound of the decision regret on the synthetic data for both the convex and non-convex objectives.}
%     \label{fig:synthetic}
% \end{figure*}




\noindent\textbf{Experimental Setup.}
% Since we know the true underlying data generation process for this synthetic setting, we compute the lower bound of the decision regret and use the gap of the model's decision regret from this lower bound as the evaluation metric. We compare with the following baselines: (1) A two-stage model trained with negative log-likelihood. (2) DFL through KKT condition~\cite{donti2017task}. For the convex objective, we implement DFL using the cvxpylayers library \cite{agrawal2019differentiable} since it can accurately differentiate through convex objectives. For the non-convex objective, we first use a black-box solver to obtain the optimal solution and then approximate the original objective using a quadratic function around the optimal solution and thus it can fit into the QPTH library \cite{amos2017optnet}. 
% (3) SO-EBM \cite{kongend}: It uses the energy-based model as a surrogate objective to speed up DFL. (4) Policy-net: It directly maps the input features to the decision variables by minimizing the task loss using supervised learning. (5) LODL \cite{shah2022learning}: it approximates the decision loss with a surrogate function. 



Since we know the true underlying data generation process for this synthetic setting, we compute the lower bound of the decision regret and use the gap of the model's decision regret from this lower bound as the evaluation metric. We compare with the following baselines: (1) A two-stage model trained with negative log-likelihood. (2) DFL~\citep{donti2017task}. %For the convex objective, we implement DFL using the cvxpylayers library \citep{agrawal2019differentiable} since it can accurately differentiate through convex objectives. For the non-convex objective, we first use a black-box solver to obtain the optimal solution and then approximate the original objective using a quadratic function around the optimal solution and thus it can fit into the QPTH library \citep{amos2017optnet}. 
(3) SO-EBM \citep{kongend}: It uses the energy-based model as a surrogate objective to speed up DFL. (4) Policy-net: It directly maps from the input features to the decision variables by minimizing the task loss using supervised learning. (5) LODL \citep{shah2022learning}: it approximates the decision loss with a surrogate function. 

For the two-stage model, DFL, SO-EBM and LODL, the forecasters use GMM with a different number of components and use 100 samples to estimate the expectation of the objective as we found that more samples bring limited performance gain but lead to longer training time. We also evaluate scenarios where the forecaster provides only a point estimate of the problem parameter, with the exception of SO-EBM, which is originally used in the probabilistic setting. For a fair comparison, we use the same backbone for the encoder of \ours and the forecaster of the baselines and 1000 attention points for both the convex and non-convex objectives. For the two-stage model, DFL and SO-EBM, the forecasters use GMM with a different number of components and use 100 samples to estimate the expectation of the objective as we found that more samples bring little performance gain. For a fair comparison, we use the same backbone for the encoder of \ours and the forecaster of the baselines and 1000 attention points for both the convex and non-convex objectives. Appendix~\ref{s:experiment} provides more details of the experimental setup and model parameters.

%\vspace{-1em}
\noindent\textbf{Results.}
Fig.~\ref{fig:synthetic} shows the results on both the convex and non-convex objective for all the methods. As we can see, \ours can outperform all the baselines. The improvement of \ours against the baselines becomes more significant on the non-convex objective. Specifically, \ours reduces the gap from the performance bound by $56.5\%$ compared with the strongest baseline LODL. When the baseline methods utilized GMMs with a different number of components, their performance deteriorated, indicating that they suffer from model mismatch errors. In contrast, our method consistently outperformed the baselines, regardless of the number of components they used. This consistency is evidence that our approach can effectively mitigate model mismatch errors. Even when the baseline methods were aligned with the ground-truth model class, our method still outperformed them since we can also avoid the sampling average approximation error at test time.


It's important to note that when the forecaster yields only a point estimate, both existing DFL frameworks and the two-stage method show the worst performance for this imbalanced cost function. This underscores the importance of quantifying uncertainty in the forecaster's predictions, especially in risk-sensitive domains.  








 %Because expected objective is too simple under this GMM synthetic setting. However, in the next subsection, we will show that when the probability distribution is highly complicated, we can significantly outperform existing methods even when the objective is convex.

Fig.~\ref{fig:landscape}  visualizes the learned expected function and the ground truth expectation on a test sample for both objectives. We found that the \ours can effectively recover the landscape of the ground truth expected cost.



% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.5\textwidth]{figure/synthetic_convex.pdf}
%     \caption{Synthetic convex}
%     \label{fig:existing}
% \end{figure}


% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.5\textwidth]{figure/nonconvex.pdf}
%     \caption{Nonconvex}
%     \label{fig:wind}
% \end{figure}

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.5\textwidth]{figure/wind.pdf}
%     \caption{Wind \lk{Will change to table}}
%     \label{fig:wind}
% \end{figure}

% Please add the following required packages to your document preamble:
% \usepackage{booktabs}
% \begin{table}[]
% \small
% \renewcommand{\arraystretch}{1.7}
% \centering
% \begin{tabular}{@{}l|c@{}}
% \toprule
% Method             & Decision regret \\ \midrule
% Two-stage 1 GMM    & 69.36\std{4.34}          \\
% Two-stage 3 GMM    & 69.89\std{1.50}             \\
% Two-stage 10 GMM   & 70.71\std{2.29}             \\
% Two-stage 500 GMM  & 66.84\std{1.43}             \\
% Two-stage 1000 GMM & 65.83\std{1.71}             \\ \midrule
% DFL 1 GMM          & 66.85\std{1.48}             \\
% DFL 3 GMM          & 66.60\std{2.01}             \\
% DFL 10 GMM         & 66.45\std{2.32}             \\
% DFL 500 GMM        & 65.06\std{0.89}             \\
% DFL 1000 GMM       & 64.65\std{3.72}             \\ \midrule
% SO-EBM 1 GMM       & 67.01\std{1.97}             \\
% SO-EBM 3 GMM       & 66.34\std{2.05}             \\
% SO-EBM 10 GMM      & 66.93\std{2.34}             \\
% SO-EBM 500 GMM     & 65.03\std{2.46}             \\
% SO-EBM 1000 GMM    & 64.53\std{2.01}             \\ \midrule
% Ours              & 60.90\std{0.60}             \\ \bottomrule
% \end{tabular}
% \caption{Wind power}
% \label{tab:my-table}
% \end{table}



\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{0.3em}
\renewcommand{\arraystretch}{0.95}
\resizebox{0.48\textwidth}{!}{%
\begin{tabular}{l|ccc}
\toprule
& \multicolumn{3}{c}{\textbf{Decision Regret}} \\
\cmidrule(lr){2-4}
Method & Power Bidding & Inventory Opt. & Vaccine Dist. \\
\midrule
Policy-net & 489.01 \std{12.39} & 3.96 \std{0.28} & 604 \std{12.30} \\
Two-stage PE & 518.19 \std{14.84} & 3.97 \std{0.15} & 573 \std{10.26} \\
Two-stage 1-GMM & 69.36 \std{4.33} & 3.32 \std{0.10} & 538 \std{9.30} \\
Two-stage 3-GMM & 69.89 \std{1.50} & 3.27 \std{0.08} & 534 \std{8.40} \\
Two-stage 10-GMM & 70.51 \std{2.29} & 3.29 \std{0.05} & 533 \std{7.95} \\
Two-stage 500-GMM & 66.84 \std{1.43} & 3.24 \std{0.07} & 524 \std{7.95} \\
Two-stage 1000-GMM & 65.83 \std{1.70} & 3.27 \std{0.05} & 527 \std{7.65} \\
SO-EBM 1-GMM & 67.32 \std{1.97} & 3.37 \std{0.02} & 512 \std{8.55} \\
SO-EBM 10-GMM & 66.93 \std{2.45} & 3.26 \std{0.03} & 513 \std{7.95} \\
SO-EBM 500-GMM & 67.02 \std{2.16} & 3.37 \std{0.05} & 513 \std{8.70} \\
SO-EBM 1000-GMM & 66.40 \std{2.23} & 3.21 \std{0.07} & 516 \std{9.45} \\
DFL PE & 69.46 \std{1.21} & 3.35 \std{0.03} & 519 \std{7.37} \\
DFL 1-GMM & 66.85 \std{1.47} & 3.36 \std{0.02} & 515 \std{8.25} \\
DFL 3-GMM & 66.60 \std{3.23} & 3.36 \std{0.05} & 513 \std{7.05} \\
DFL 10-GMM & 66.45 \std{2.32} & 3.31 \std{0.01} & 513 \std{7.65} \\
DFL 500-GMM & 65.06 \std{0.88} & 3.24 \std{0.09} & 507 \std{6.60} \\
DFL 1000-GMM & 64.65 \std{3.70} & 3.21 \std{0.07} & 513 \std{7.35} \\
LODL PE & 67.92 \std{1.49} & 3.36 \std{0.06} & 512 \std{7.01} \\
LODL 1-GMM & 66.87 \std{1.36} & 3.34 \std{0.01} & 508 \std{6.23} \\
LODL 3-GMM & 65.75 \std{1.86} & 3.31 \std{0.06} & 506 \std{6.84} \\
LODL 10-GMM & 65.29 \std{1.23} & 3.26 \std{0.02} & 504 \std{6.38} \\
LODL 500-GMM & 64.24 \std{1.45} & 3.22 \std{0.05} & 502 \std{7.02} \\
LODL 1000-GMM & 64.13 \std{2.47} & 3.24 \std{0.04} & 503 \std{7.01} \\
\rowcolor{mygray}
Ours & $\bm{60.90}$ \std{0.60} & $\bm{3.09}$ \std{0.09} & $\bm{492}$ \std{7.05} \\
\bottomrule
\end{tabular}
}
\caption{Decision regret of each method -- \textbf{lower is better}. `PE' denotes point estimate for the parameter.}
\label{table:main}
\end{table}


\subsection{Real-World Problems}
Next, we delve into three real-world problems encompassing both convex and non-convex objectives.


\subsubsection{Experimental Setup.}
\textbf{Wind Power Bidding.}
In this task, a wind power firm engages in both energy and reserve markets,  given the generated wind power $\mathbf{x}\in \mathbb{R}^{24}$ in the last 24 hours. The firm needs to decide the energy quantity $\mathbf{a}_E \in \mathbb{R}^{12}$ to bid and quantity $\mathbf{a}_R \in \mathbb{R}^{12}$ to reserve over the next 12-24 hours in advance, based on the forecasted wind power $\mathbf{y}\in \mathbb{R}^{12}$. The optimization objective is a piecewise function consisting of three segments \citep{manasssakan2022,di2020bidding}, which is to maximize the revenue of the energy sales while minimizing the penalties for decision inaccuracies of overbidding and underbidding. 




 % In this task, a wind power firm engages in both energy and reserve markets,  given the generated wind power $\mathbf{x}\in \mathbb{R}^{24}$ in the last 24 hours. The firm needs to decide the energy quantity $\mathbf{a}_E \in \mathbb{R}^{12}$ to bid and quantity $\mathbf{a}_R \in \mathbb{R}^{12}$ to reserve over the next 12-24 hours in advance, based on the forecasted wind power $\mathbf{y}\in \mathbb{R}^{12}$. The optimization objective is a piecewise function consisting of three segments \cite{manasssakan2022,di2020bidding}, which is to maximize the revenue of the energy sales while minimizing the penalties for decision inaccuracies of overbidding and underbidding. 
% \begin{figure}[t]
%   \centering
%   \includegraphics[width=0.45\textwidth]{figure/wind_plot_new2.pdf}
%   \caption{Decision regret on wind power bidding. }
%   \vspace{-1.2em}
%   \label{fig:wind}
% \end{figure}
%includes revenue from energy sales at regulatory prices, costs of reserve capacity, and penalties for decision inaccuracies~\cite{manasssakan2022,di2020bidding}.
% \begin{small}
% \begin{align}
% &\text{maximize}_{E \in \mathbb{R}^{12},R \in \mathbb{R}^{12}} \sum_{i=12}^{24}f(x) = Py_i - \nu a_{R,i} \\ + \nonumber
% &\begin{cases} 
% \Delta P_{up,1}(a_{E,i}-a_{R,i}-y_i) -\Delta P_{up,2}(a_{E,i}-a_{R,i}-y_i)^2- \mu a_{R,i} -F,& \text{if } y_i < a_{E,i} - a_{R,i} \\
% - \mu (a_{E,i}-y_i), & \text{if } a_{E,i}-a_{R,i} \leq y_i \leq a_{E,i} \\
% \Delta P_{down}(y_i-a_{E,i}), & \text{if } y_i > a_{E,i} 
% \end{cases}
% \end{align}
% \end{small}

% $P$ is the regular price of wind energy sold, $y_i$ is the energy generated during period $i$,  $a_{E,i}$ and $a_{R,i}$ are the bid and up reserve energy volumes for period $i$, respectively. $\nu$ corresponds to the opportunity cost when the company participate in the reserve markets, and $\mu$ is the deploy price of reserve energy. This structure encapsulates three market participation scenarios. In the scenario where $y_i < a_{E,i} - a_{R,i}$, the company overbids, consequently deploying all reserve energy and facing penalties determined by coefficients $\Delta P_{up,1}$, $\Delta P_{up,2}$, and a constant term $F$. If $ a_{E,i}-a_{R,i} \leq y_i \leq a_{E,i}$, the company meets its bid by deploying reserve market energy, thereby avoiding penalties. However, when $y_i > a_{E,i}$, the company underbids, resulting in the selling of surplus electricity at a discount and incurring losses defined by the coefficient $\Delta P_{down}$. We provide more illustrations of this objective in the Appendix.
% \lk{To reduce}
% The optimization objective is to maximize the profit which is a piecewise function consisting of three segments.  In the first scenario, where the bid quantity is greater than the sum of the actual output and the reserve energy capacity, the firm overbids and must incur costs for deploying reserve energy and penalties for surplus electricity. In the second scenario, where the bid quantity exceeds the actual output but remains less than the combined total of the actual output and reserve energy capacity, the firm procures energy from the reserve market to fulfil its bid, thus avoiding penalties. In the final scenario, where the bid quantity is less than the actual output, the firm's underbidding results in selling surplus electricity at a discounted price. We provide more details of the optimization objective in the Appendix.



% \noindent \textbf{Experimental Setup.}
% We use the wind power generation dataset of a German energy company during 08/23/2019 to 09/22/2020 \cite{winddata}.
% The forecaster of the two-stage model, DFL and SO-EBM is a two-layer long short-term memory networks (LSTM) \cite{hochreiter1997long} which takes the historical wind power in the last 24 hours as input features and outputs the forecasted wind power for the subsequent 12 to 24 hours. For a fair comparison, \ours uses the same LSTM architecture as the encoder and 500 attention points. During training, the two-stage model, DFL and SO-EBM use 100 samples to estimate the expected objective as more samples provide little performance gain. 

% \noindent \textbf{Results.} Though this is a convex objective, our method can consistently outperform all the baselines by a significant margin even when the number of GMM components of baselines is larger than the attention points of \ours. Specifically, \ours can improve the decision regret by $5.8\%$
% compared with the strongest baseline.
% The wind power forecasting task is marked by high uncertainty, which makes it challenging to construct an accurate model assumption. Therefore, in this context, a more effective approach is to learn the expected cost function directly from the data, without imposing any parametric distribution assumption. 
%\vspace{-0.5em}
\textbf{Inventory Optimization.}
In this task, a department store is tasked with predicting the sales $\mathbf{y} \in \mathbb{R}^7$ for the upcoming 7th-14th days based on the past 14 days' sales data $\mathbf{x} \in \mathbb{R}^{14}$ for a specific product, and accordingly, determining the best replenishment strategy $\mathbf{a} \in \mathbb{R}^7$ for each day. The optimization objective is a combination of an under-purchasing penalty, an over-purchasing penalty, and a squared loss between supplies and demands. %We provide more details of the optimization objectives in Appendix \ref{s:experiment}.

%\paragraph{Results}

% \vspace{-1em}
% \subsection{Vaccine Distribution for COVID-19}
% \vspace{-0.5em}
% \begin{figure}[t]
%   \centering
%   \includegraphics[width=0.45\textwidth]{figure/covid_new.pdf}
%   \caption{Decision regret on vaccine distribution. }
%   \vspace{-0.5em}
%   \label{fig:covid}
% \end{figure}
%\vspace{-0.7em}
\textbf{Vaccine Distribution for COVID-19.}
During the COVID-19 pandemic, computing a vaccine distribution strategy is one of the most challenging problems for epidemiologists and policymakers. In practice, meta-population Ordinary Differential Equations (ODEs) based epidemiological models~\citep{pei2020differential} are widely used to predict and evaluate the outcomes of different vaccine distribution strategies. These models rely on people mobility data, such as Origin-Destination (OD) matrices, to capture the pandemic spread dynamics across diverse locations~\citep{li2020substantial}.
In this task, given the OD matrices $\mathbf{x}\in \mathbb{R}^{47\times 47 \times 7}$ of last week, \ie~ $\mathbf{x}[i,j,t]$ represents the number of people move from region $i$ to $j$ on day $t$, we need to decide the vaccine distribution $\mathbf{a} \in \mathbb{R}^{47}$ across the 47 regions in Japan with a budget constraint ($\mathbf{a}[i]$ is the number of vaccines distributed to the region $i$). 
 The optimization objective is to 
minimize the total number of infected people over the ODE-drived dynamics, based on the forecasted OD matrices $\mathbf{y}\in \mathbb{R}^{47 \times 47 \times 7}$ for the next week. This task is a challenging non-convex optimization problem due to the nonlinear simulation model. 

Due to space limit, we provide more details of the experimental setup and the optimization objectives in Appendix~\ref{s:experiment}. 




\subsubsection{Results.} Table~\ref{table:main} presents the decision regret across three real-world problems, demonstrating that our method consistently outperforms all baselines. Specifically, \ours improves decision regret by \(\{5.0\%, 3.7\%, 2.0\%\}\) compared to the strongest baseline. These three forecasting tasks are characterized by high uncertainty, making it challenging to formulate a precise model assumption. In such scenarios, it is more effective to derive the expected cost function directly from the data, eliminating the need for any parametric distribution assumptions. Moreover, it is clear that simply increasing the number of components in the GMM does not significantly enhance DFL's performance due to increased sample approximation errors. Finally, the probabilistic approach generally exhibits higher and more reliable performance than methods that rely solely on learning a point estimate forecaster.


%Particularly, in this work, we use the SEIR (Susceptible-Exposed-Infectious-Recovered) model to forecast the outcomes of different vaccine distribution strategies $\mathbf{a}$. 
%The forecasted number of COVID-19 cases for location $i$ corresponds to the number of cases transitioning from the exposed (E) to the infectious (I) stage. Following the prior research~\cite{chen2022effective}, for each location $i$, we subtract $\mathbf{a}[i]$ individuals from the susceptible population $S[i]$, representing that they have been vaccinated and are no longer susceptible to COVID-19 infection. Our aim is to minimize COVID-19 cases as forecasted by ODE models.

% \begin{wrapfigure}{r}{0.5\textwidth}
%   \centering
%   \begin{center}
%     \includegraphics[width=0.5\textwidth]{figure/covid_new.pdf}
%   \end{center}
%   \caption{Decision regret on vaccine distribution. }
%   \label{fig:covid}
% \end{wrapfigure}

% \noindent\textbf{Experimental Setup.}
% We use the OD matrices dataset of Japan~\cite{japandata} during 04/01/2020 to 02/28/2021.
% The two-stage model, DFL and SO-EBM use DC-RNN \cite{li2018diffusion} as the forecaster which adopts an encoder-decoder architecture. The forecaster takes the OD matrices of last week as input features and predicts the OD matrices of next week. For a fair comparison, \ours employs the same encoder in the network architecture and uses 100 attention points. During training, the two-stage model, DFL and SO-EBM use 100 samples to estimate the expected objective as more samples provide little performance gain. 



% \noindent\textbf{Results.} Fig.~\ref{fig:covid} shows the decision regret on the COVID-19 vaccine distribution problem for all the methods. \ours can outperform the strongest baseline by $3.0\%$ in terms of the decision regret. 
% As we can see, on this challenging task, increasing the number of components in GMM cannot effectively improve the performance of DFL as it may suffer from a more severe sample approximation error. 


%\textcolor{red}{As lockdown ends, COVID-19 response strategies are becoming more adaptable, striking a balance between disease control and the resumption of economic activity. Hence, human movement patterns (and therefore OD matrices)} are changing dynamically with high uncertainty, and imprecise predictions can result in suboptimal vaccine distribution strategies. 





% \subsection{Comparable case \lk{Depends on the space}}
% \lk{I feel it is better to put a case where our model cannot outperform other models when the forecasting task is very simple. Maybe we can use the power task.}




\subsection{Ablation Study}


\begin{figure*}[!ht]
    %\hspace{-1.4cm}
    \centering
    \includegraphics[width=0.9\textwidth]{figure/ablation_v2-KDD.pdf}
    \caption{Ablation study on the impact of attention-based architecture, number of attention points, and training data size on the wind power bidding problem.}
    \label{fig:ablation}
\end{figure*}

% \subsubsection{Experimental Setup}
% We provide the details of the experimental setup and the optimization objectives in Supplementary material~\ref{s:experiment} due to the space limit.
\begin{figure}[h]
    %\hspace{-1.4cm}
    \centering
    \includegraphics[width=0.5\textwidth]{figure/ab23_v2.pdf}
    \caption{\ours vs. DFL with different numbers of samples: the left figure shows decision regret, while the right figure displays training time.}
    \label{fig:ab1}
\end{figure}

\begin{figure}[h]
    %\hspace{-1.4cm}
    \centering
    \includegraphics[width=0.45\textwidth]{figure/ab23_v3-KDD.pdf}
    \caption{Impact of learnable value embeddings and number of action samples.}
    \label{fig:ab23}
\end{figure}



In this subsection, we investigate each component of \ours via ablation studies on the wind power bidding. %We aim to answer the following question: Q1: Does the attention-based network architecture plays a key role in the outstanding performance of \ours? Q2: How does the number of attention points affect the performance? Q3: How much data does \ours need?  
% Fig.~\ref{fig:ablation} shows the results and our findings can be summarized as follows:

\emph{Impact of attention-based architecture.} Without the attention-based network architecture, we see a significant performance drop in Fig.~\ref{fig:ablation}(a). This is because, without the attention architecture, the network architecture may not be within the true model class and  thus suffer from high bias error in Proposition~\ref{prop:1}. 

\emph{Impact of number of attention points.} Our model performance can be improved with more attention points as in Fig.\ref{fig:ablation}(b). We also plot the decision regret and training time of DFL. We find that when the number of attention points is over 200, \ours can outperform DFL in terms of the decision regret while being orders of magnitude faster. 

\emph{Impact of training data size.} Our method outperforms baselines constantly with different ratios of training data as shown in Fig.~\ref{fig:ablation}(c). The superior performance is because we use attention-based network architecture to mimic the distribution-based parameterization. Compared with the two-stage model, we are decision-aware; compared with DFL methods, we mitigate the three bottlenecks. 




\emph{\ours vs DFL with different number of samples.} The number of samples used to estimate the expected objective in DFL is an important hyperparameter. To investigate its impact, we compare the decision regret and training time of \ours with DFL using different numbers of samples. We use GMM with 1000 components in the DFL forecaster as it achieves the best performance shown in Section 5.2.
As shown in Fig.~\ref{fig:ab1}, when the number of samples for DFL exceeds 100, the performance improvement becomes very marginal (64.52 with 100 samples vs. 64.07 with 200 samples). However, the training time increases significantly (1878 seconds/epoch with 100 samples vs. 5251 seconds/epoch with 200 samples). In contrast, \ours achieves significantly better decision regret (58.41) while being orders of magnitude faster (2.17 seconds/epoch). 


\emph{Impact of learnable value embeddings.} In \ours, the value embeddings are initialized with randomly sampled labels from the training set and then updated during the training process. An alternative is to directly use these randomly selected labels and keep the value embeddings fixed during the training process.
We examine whether making the value embeddings learnable improves the performance. The results are shown in Fig.~\ref{fig:ab23}(a). As we can see, with learnable value embeddings, the decision regret of \ours decreases significantly compared with the fixed value embeddings.




 \emph{Impact of number of action samples.} In \ours, we need to sample actions for each $(\mathbf{x},\mathbf{y})$ pair at each training iteration to fit the function.  In this study, we investigate the influence of the number of action samples on the performance. As shown in Fig.\ref{fig:ab23}(b), the decision regret remains stable even for a sample size of 5. Notably, as the number of action samples increases, the variance of the decision regret across different random seeds decreases, indicating improved stability in the results.


% We investigate the effectiveness of each model design of \ours via ablation studies on the wind power bidding. %We aim to answer the following question: Q1: Does the attention-based network architecture plays a key role in the outstanding performance of \ours? Q2: How does the number of attention points affect the performance? Q3: How much data does \ours need?  

% Fig.~\ref{fig:ablation} (in Appendix.~\ref{sec:ablation}) shows the results and our findings can be summarized as follows:
% (1) Without the attention-based network architecture, we see a significant performance drop in Fig.~\ref{fig:ablation}(a). This is because, without the attention architecture, the network architecture may not be within the true model class and  thus suffer from high bias error in Proposition~\ref{prop:1}. (2) Our model performance can be improved with more attention points as in Fig.\ref{fig:ablation}(b). We also plot the decision regret and training time of DFL. We find that when the number of attention points is over 200, \ours can outperform DFL in terms of the decision regret while being orders of magnitude faster. (3) Our method outperforms baselines constantly with different ratios of training data as shown in Fig.~\ref{fig:ablation}(c). The superior performance is because we use attention-based network architecture to mimic the distribution-based parameterization. Compared with the two-stage model, we are decision-aware; compared with DFL methods, we do not suffer from the three bottlenecks. We also provide additional ablation studies and analyses in Appendix~\ref{sec:ablation}.





% \begin{figure}[t]
%     %\hspace{-1.4cm}
%     \centering
%     \includegraphics[width=0.2\textwidth]{figure/ablation1.pdf}
%     \caption{ablation1}
%     \label{fig:wind}
% \end{figure}

% \begin{figure}[t]
%     %\hspace{-1.4cm}
%     \centering
%     \includegraphics[width=0.5\textwidth]{figure/ablation2.pdf}
%     \caption{ablation2}
%     \label{fig:wind}
% \end{figure}

% \begin{figure}[t]
%     %\hspace{-1.4cm}
%     \centering
%     \includegraphics[width=0.5\textwidth]{figure/ablation3.pdf}
%     \caption{ablation3}
%     \label{fig:wind}
% \end{figure}

% \begin{figure}[t]
%     %\hspace{-1.4cm}
%     \centering
%     \includegraphics[width=0.5\textwidth]{figure/ablation4.pdf}
%     \caption{ablation4}
%     \label{fig:wind}
% \end{figure}



% \begin{figure}[t]
%     \centering
%     \begin{subfigure}[t]{0.45\textwidth}
%         \includegraphics[width=\textwidth]{figure/demand.pdf}
%         \caption{Customer Item Demand}
%         \label{fig:figure3}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.45\textwidth}
%         \includegraphics[width=\textwidth]{figure/sodemand.pdf}
%         \caption{Power Demand}
%         \label{fig:figure4}
%     \end{subfigure}
% \end{figure}
