\section{Introduction}
\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{uai2025/figs/illustrative_fig.pdf}
    \caption{
    We consider scenarios where the expert holds an imprecise belief over the outcome $o \in \cO$, represented as $\cP \subseteq \Delta(\cO)$. The goal is to truthfully elicit this belief, i.e., the best report $Q$ should be $\cP$. The leftmost figure directly extends precise scoring rules to the imprecise case, ignoring the downstream DM. Truthful elicitation in the imprecise setting requires the DM to share their aggregation rule $\rho$ with the expert (middle). To avoid DM's strategic manipulation by the forecaster, DM shares a distribution $\theta(\rho)$ over aggregation rules (right), resulting in a strictly proper scoring rule $s_\theta$.}
    \label{fig:main-fig}
\end{figure*}
\begin{comment}
\begin{itemize}
    \item In statistics proper scoring rules are used for the evaluation of forecasts. 
    \item Explain mechanism design connection to proper scoring rule in statistics 
    \item forecaster has ambiguity in the real world which is not captured in precise scoring rules
    \item Example of imprecision in weather forecasting, or link ML to proper scoring rule and when there are many data sources generalization requires solving this ambiguity. 
    \item Previous point motivates the need for better methods to capture and acknowledge this ambiguity. Imprecision as a plausible solution (Why should we consider imprecision)
    \item Explain our setup, impossibility in the scoring imprecise forecasts in the void, decision-maker is needed to guide forecaster to reduce epistemic uncertainty. 
    \item How we can circumvent the impossibility by not sharing the aggregation rule with the forecaster.
\end{itemize}
\end{comment}
Probabilistic forecasting is a powerful tool for decision-making under uncertainty with diverse applications ranging from energy demand forecasting~\citep{pinson2012evaluating,pinson2013wind} and credit risk assessment~\citep{rindt2022survival,yanagisawa2023proper} to machine learning (ML)~\citep{singh2023robust} and large language models (LLMs)~\citep{shao2024language,wu2024elicitationgpt}.
Proper scoring rules serve as fundamental tools for evaluating the quality of probabilistic forecasts~\citep{brier1950verification,murphy1988decomposition,gneiting2007strictly}. They also serve as a backbone for eliciting other distributional properties such as their moments~\citep{frongillo2014general}. By assigning numerical scores based on the reported forecast and the realized outcome, these rules incentivize truthful reporting, i.e., any deviation from the forecaster’s true beliefs would result in suboptimal scores. Beyond applications in statistics, proper scoring rules have a deep connection with mechanism design, a sub-field of economics. When used as a payment mechanism, the agents have no incentive to lie, a property known as incentive compatibility~\citep{myerson1981optimal}. 
%This connection ensures that truthful reporting becomes a dominant strategy, much like how well-designed incentive-compatible mechanisms in economics elicit honest behaviour from participants~\citep{milgrom1994market}.
%\Krik{The starting point is fine, but I am not sure whether the readers will be able to grasp the connections to mechanism design. You might want to add a bit more context into it rather than throwing complicate terms (incentive compatible, dominant strategy, etc) at the readers.} 
%\Krik{Add that classically elicitation does not care about the source of belief}

Traditionally, scoring rules operate under the assumption that forecasters possess a \emph{precise} probabilistic belief about some uncertain event. They are designed to reward the forecasters whose forecasts reflect their true precise beliefs~\citep{savage1971elicitation, gneiting_probabilistic_2014}. 
For example, in weather forecasting~\citep{brier1950verification} a forecaster who believes there is a 60\% chance of rain tomorrow should ideally report 60\% as their forecast.
However, in many real-world scenarios, forecasters face significant ambiguity due to the inherent complexity of atmospheric systems, coupled with limited data and model resolution, which introduce substantial imprecision \citep{wilks2011statistical}. It is thus plausible for forecasters to report imprecise probability assessments in these scenarios; for example, the chance of rain tomorrow may be assessed within the interval $[50\%, 70\%]$. Importantly, classical proper scoring rules built for precise forecasts cannot account for such additional uncertainty~\citep{konek2015}. 

Under the context of machine learning, imprecise forecasting is closely related to the concept of out-of-distribution (OOD) generalization~\citep{muandet_domain_2013,Zhou21:DG-Review}.
In standard supervised learning, where training and test data are assumed to be independent and identically distributed (i.i.d.), the predictive model reflects the learner's precise belief about the data generating process. However, in OOD generalization---where multiple training datasets are observed, and the test data may not be i.i.d. with the training data---\citet{singh2024domain} argue that the notion of generalization (e.g., average-case or worst-case optimization strategy) should be determined by the model's end user, also referred to as the decision-maker (DM). When direct interaction between the learner and the DM is not possible, \citet{singh2024domain} propose an \emph{imprecise learning} algorithm that trains a portfolio of predictors (forecasts) in advance, which are then provided to the DM. In contrast, for practical scenarios where the learner and DM can communicate, eliciting precise forecasts is straightforward using classical scoring rules. However, eliciting imprecise forecasts remains challenging due to the lack of suitable imprecise scoring rules. This gap motivates us to design appropriate imprecise scoring rules that are applicable beyond machine learning contexts. 

The key challenge to designing an appropriate scoring rule arises from the forecaster’s epistemic uncertainty. This challenge has led to several impossibility theorems for strictly proper imprecise scoring rules~\citep{seidenfeld2012forecasting,mayo2015accuracy,schoenfield2017accuracy}. However, these works focus solely on eliciting imprecise forecasts from the forecaster, overlooking the fact that probabilistic forecasts are typically used for downstream decision-making, making elicitation rarely the sole objective. Without input from the DM during elicitation, forecaster must rely solely on their imprecise belief, which contains inherent ambiguity. This often leads to indecision during elicitation---a key factor behind prior impossibility results. Recently, \citet{frohlich2024scoring} explored imprecise scoring rules involving DMs, but their analysis focused only on min-max (pessimistic) decision-making and lacked formal discussion of the DM's role. More broadly, indecision can be resolved through subjective choices beyond the min-max rule. However, it cannot be resolved by forecasters alone without eliminating their epistemic uncertainty. We argue that the DM must actively assist forecasters in navigating indecision by communicating their subjective preferences.

\textbf{Our contributions.} To address this challenge, we propose a novel setup for scoring imprecise forecasts where we consider a DM as an additional agent, who actively guides the forecaster in resolving indecision during elicitation (see \Cref{fig:main-fig} for different scenarios). Our contributions are summarized as follows:
\begin{itemize}
    \item We show that prior impossibility results stem from the lack of communication between DM and the forecaster. 
    \item We formalise DM-forecaster communication using aggregation rules from social choice theory~\citep{arrow2012social} and generalize tailored scoring rules
   ~\citep{johnstone2011tailored} to accommodate these aggregations. 
   \item We analyze the connection between axiomatic properties of aggregation rules from the social choice perspective and their impact on both truthful elicitation from the forecaster and the DM's decision-making process.
   
    \item By restricting to strategic communication, specifically by sharing only a distribution over aggregation rules, we propose a novel randomized tailored scoring rule that is strictly proper for imprecise forecasts. 
\end{itemize}
The rest of the paper is organized as follows. Section~\ref{sec:preliminaries} introduces proper scoring rules and imprecise probabilities. Section~\ref{section:setup} then formalizes the notion of an imprecise forecaster and outlines decision-making for the forecaster and DM. Next, Section~\ref{sec:imprecisescoringrules} explores imprecise scoring rules, first without communication and then with aggregation. Section~\ref{sec:rand-tailored-rules} presents strictly proper scoring rules for imprecise forecasts, while Section~\ref{sec:related-work} reviews prior work. Finally, Section~\ref{sec:discussion} concludes with a discussion of future directions.
%In Section~\ref{sec:preliminaries}, we introduce proper scoring rules and imprecise probabilities. In Section~\ref{section:setup}, we describe the decision-making aspects for the forecaster and the DM. In addition, we formalise the notion of an imprecise forecaster. In Section~\ref{sec:imprecisescoringrules}, we discuss imprecise scoring rules, firstly without any communication between the DM and the forecaster and then operationalise communication with aggregation. In Section~\ref{sec:rand-tailored-rules}, we describe the strictly proper scoring rules for imprecise forecasts, while Section~\ref{sec:related-work} describes prior works in scoring rules for imprecise probabilities. Finally, Section~\ref{sec:discussion} concludes our paper with a discussion of future works.
%\Alan{This paragraph can be removed if we run out of space later.}