\section{Conclusion and Limitations}
We focus on mitigating the three bottlenecks of DFL by differentiating through KKT conditions under the probabilistic setting: (1) model mismatch error, (2) sample average approximation error, and (3) gradient approximation error. To this end, we propose \ours -- the first distribution-free DFL method which does not require any model assumption. \ours adopts a distribution-free training objective that directly learns the expected cost function from the data. To reduce the bias error, we design an attention-based network architecture, drawing inspiration from the distribution-based parameterization of the expected cost function. Empirically, we demonstrate that \ours is effective in a wide range of stochastic optimization problems with either convex or non-convex objectives. 


\emph{Limitations.} In our work, we focus on the probabilistic setting where the predictive distribution of the forecasting task has high uncertainty. In this setting, both model mismatch error and sample average approximation error are significant. However, if the forecasting task is relatively straightforward, a simple Gaussian distribution might suffice. For certain objective functions, the expectation under a Gaussian distribution has a closed-form expression. In such cases, existing model-based DFL methods may still be a better choice.

 
When the number of attention points is large, scalability may become an issue at inference time. This challenge can potentially be alleviated by employing fast attention mechanisms, such as sparse attention (\eg, Longformer; \citep{beltagy2020longformer}) or low-rank approximations (\eg, Linformer; \citep{wang2020linformer}).



%Due to space limit, we discuss limitations in Appendix.\ref{s:limitations}.


