\def\stmt{$A$}
% \def\stmt{$\phi$}


\commentout{%oli-v1
	The ability to articulate a \emph{degree of confidence}
	% (or the opposite: a degree of uncertainty)
	is a critical aspect of representing knowledge.
	There are
	many well-established ways to quantify (un)certinaty \parencite[\S2]{halpern2017reasoning},
		and chief among them is probability.
	While ``confidence'' can be coherently read in probabilistic terms,
		such usage may shadow another important concept.
	This paper details a different conception that arises when updating beliefs.
	As we shall see, this notion of confidence
	complements traditional representations of uncertainty (such as probability),
	and moreover unifies several different concepts across AI.
}

% What should it mean to say that one has a high degree of confidence in a statement $\phi$? 
% It is often taken to mean that we think $\phi$ is likely.
What does it mean to have a high degree of confidence in a statement $\phi$? 
It is often taken to mean that $\phi$ is likely.
% This paper details a different conception that arises when updating beliefs.
% As we shall see, this notion of confidence
% complements traditional representations of uncertainty (such as probability),
We argue that there is also another conception of confidence that arises when learning---one that complements likelihood and, moreover, unifies several different concepts in the literature.
% Here we argue that there is a related but more useful way of defining confidence, which complements notions like probability and, moreover, unifies several different concepts in the literature.
%oli4
% For us, confidence is a measure of \emph{trust}, rather than likelihood.
This kind of confidence is a measure of \emph{trust} in an observation $\phi$, rather than its likelihood;
% We are intereted in the conception of confidence as measuring \emph{trust}, rather than likelihood.
% In more detail, the {degree of confidence} that one has in a piece
% The degree of confidence in a new piece of information $\phi$
%is a number  $\chi \in [\bot,\top]$ that
it
quantifies how seriously to take $\phi$ in updating our beliefs.
So at one extreme,
if we observe $\phi$ but have no confidence in it,
we do not change our beliefs at all;
at the other, if we have full confidence in $\phi$,
we fully (and irreversibly) incorporate it into our beliefs.

% If our belief state is a probability measure $\Pr$ and $\phi$ is an event, for example, then fully incorporating $\phi$ amounts to conditioning $\Pr$ on it.
\begin{example}
	% [Linear interpolation]
 \label{ex:prob-simple}
%oli2: alternate presentation that makes it clear that \Pr | \phi is
%   a conditional measure; it is uncommon to write it this way
%   in modern ML, and may be confusing without explanation.
% Suppose our belief state is a probability measure, and $\phi$ is an event.
% A full-confidence update then amounts to conditioning on $\phi$, after which $\phi$ has probability 1, and so cannot be further incorporated.
Suppose our belief state is a probability measure $P$, and we observe an event $\phi$.
% A full-confidence update then amounts to conditioning on $\phi$ (i.e., adopting the belief state $\Pr\mid\phi$), after which $\phi$ has probability 1 and cannot be further incorporated.
The standard way to learn $\phi$ is to condition on it  (i.e., adopt belief state $P \mid \phi$). 
% A full-confidence update then amounts to conditioning on $\phi$ (i.e., adopting the belief state $\Pr\mid\phi$), after which $\phi$ has probability 1 and cannot be further incorporated.
This is a full-confidence update; $\phi$ has probability 1 afterwards, 
	and conditioning on it again has no further effect.
Here is one obvious way
to interpret intermediate degrees of confidence:
starting with prior $P$ and
learning $\phi$ with confidence
$\alpha \in [0,1]$,
%oli2:
% and start with prior probability $\Pr$, then we end up with the
% and start with prior $P$, 
% then we might end up with the
% leads us to end up with
we end up with
%oli1: I agree that this notation is nicer, but these days it
% is also much less standard than p(X) and p(X|\phi). In particular,
% my ML friends who work with applied probability a lot do not understand
% how to parse  "\Pr | \phi".
posterior $(1-\alpha)P + \alpha (P \mid \phi)$.
%
Thus, having high confidence in $\phi$ leads to posterior beliefs that give $\phi$
high probability.
The converse is false, however.
	% so confidence and probability can be quite different.
If an untrusted source tells us $\phi$ which we already happen to believe,
then our prior assigns $\phi$ high probability,
we learn $\phi$ with low confidence,
and our posterior beliefs still give $\phi$ high probability.
%joe1: I don't know what "more independent" means
% Confidence and prior probability are even more independent:
% Also,
% Confidence is even less coupled to prior probability:
% Prior probabilty is further decoupled from confidence:
% Prior probability ($\Pr(\phi)$) and confidence $\chi$ are further decoupled:
%joe3: Cut the rest of the paragraph. In the previous you argued that confidence is not the same as posterior probability. Now you suddenly bring in prior probability. The story here is completely unclear.
%oli3: Right, I previously argued that confidence is not the same as posterior probability. Now I'm saying it's not the same as prior probability either.  (and indeed, it's even less related.) Why is the story unclear?
\commentout{
Prior probability and confidence are further decoupled:
if we learn a surprising fact $\phi$ from a trusted source, we have high confidence in $\phi$ despite it having low prior probability.
}
% This leads to an important observation: one's confidence in $\phi$ is not a feature of one's prior probability at all.
% Thus, confidence in $\phi$ is independent from $\Pr(\phi)$.
\end{example}


Confidence allows us to be uncertain about observations,
which is quite different in principle from making observations that are uncertain.
% There are well established ways of doing this, such as Jeffrey's rule
% \emph{Jeffrey's rule} \parencite{} is a well-established way of handling the latter, and so it is often viewed as a generalization of conditioning that allows for uncertain observations.
\emph{Jeffrey's rule} (\citeyear{Jeffrey68}) 
% (see \cref{sec:full-conf}) 
is a well-established approach to the latter.
% An important feature of the former, however, is that it enables
% A critical feature of confidence, by contrast, is that it enables learning without fully committing to new observations.
% A critical aspect of intermediate confidence, by contrast, is that it enables learning without fully committing to new observations.
An important feature of the former, however, is that it enables
learning without fully committing to new observations.
% Contrast this with conditioning, which is irreversible:
% Updating with full confidence in \cref{ex:prob-simple} means 
Full-confidence updates, such as conditioning in \cref{ex:prob-simple}, are irreversible: 
% once you condition on $\phi$, and there is no way to recover your prior probability of $\phi$. 
% there is no way to ``$\phi$-uncondition'' a posterior $\Pr|\phi$ to recover the prior $\Pr$.
from $\phi$ and the posterior $P|\phi$, it is not possible to recover the prior belief $P$.
% The same is true of Jeffrey's rule, and so both are full-confidence updates.
% As we will see, the same is true of Jeffrey's rule; 
The same is true of Jeffrey's rule,
 % which we also view as prescribing full-confidence updates. 
which, in our conception, also prescribes full-confidence updates.
	%  (in which one's observation is probabilistic). 
% ---just with different observations.
% Contrast this with conditioning, which is irreversible: once you condition on $A$,
% Full-confidence updates are irreversible: once you condition on $\phi$, for  
% and there is no way to recover your prior probability of $\phi$. 
% The same is true of Jeffrey's rule. 
 % and in our telling, both are full-confidence updates---just with different observations.
% which we will later see is a full-confidence update as well.
% which (as we will later see) also prescribes a full-confidence update.
%
% Our work is more closely related to that of
The concept we propose here is more similar to 
% that of \citeauthor{shafer1976mathematical}, one thrust of whose \citeyear{shafer1976mathematical} book is to develop a theory of what we have been calling confidence, tailored to a specific representation of uncertainty \cite{shafer1976mathematical}.
that behind of Shafer's \emph{Theory of Evidence} (\citeyear{shafer1976mathematical}),
although his account is specialized to a specific representation of uncertainty that has since fallen out of fashion.

\begin{example}
	% [Evidence]
	%[Belief Functions]
 	\label{ex:shafer}
Suppose our beliefs are represented by a 
\emph{%
	(Dempster-Shafer)
belief function},
which generalizes a probability measure over
a finite set $W$ of possible worlds.
% Let $W$ be a finite set of possible worlds,
% and suppose our beliefs are represented by
% \emph{belief function}: a generalization of a probability over $W$,
% in which the degree of belief in an event and in its complement 
%  	may sum to less than one.
\commentout{
    More precisely,
    % belief functions are in 1-1 correspondence with
    we define our belief state to be a
    \emph{mass function} $m : 2^W \! \to\! [0,1]$
    satisfying $\sum_{U \subseteq W} m(U) \!=\! 1$
    and $m(\emptyset) \!=\! 0$.
    Such mass functions have a 1-1 correspondence with belief functions, and
    the belief function corresponding to $m$ is given by
    $\Bel_m(U) = \sum_{V \subseteq U} m(V)$
    \parencite{shafer1976mathematical}.}%
%
% \def\complem#1{W\setminus #1}
% \def\complem#1{\bar{ #1}}
\def\complem#1{\overline{ #1}}%
% \def\complem#1{\lnot #1}
% \def\complem#1{ { #1}^c}
%
Like a probability, a belief function $\Bel$ assigns to each event
$U \subseteq W$ a number $\Bel(U) \in [0,1]$,
% and satisfies
with
$\Bel(\emptyset) = 0$ and $\Bel(W) = 1$.
% It is not necessarily the case that
It need not necessarily be that
$\Bel(U) + \Bel(\complem{U}) = 1$, but $\Bel$
must satisfy certain axioms (whose details do not matter for our purposes)
ensuring that
% \begin{equation}
$
	\Bel(U) + \Bel(\complem{U}) \le 1.
$
%         \tag{{$\ast$}}
%         \label{eq:bel-plaus-le-one}
% \end{equation}
$\Bel$ can be equivalently represented by its
% corresponding
\emph{plausibility function}
$\Plaus(U) := 1 - \Bel(\complem{U})$.
% From \eqref{eq:bel-plaus-le-one} it follows 
It is easy to see that $\Bel(U) \le \Plaus(U)$, and 
if $\Bel$ is a probability measure, then
% the two are equal.
% $\Bel(U) = \Plaus(U)$.
$\Bel = \Plaus$.

\commentout{
	Suppose we come accross evidence that supports an event $\phi
	\subseteq W$ to a degree $\alpha \in [0,1]$.
	Together, $\phi$ and our confidence $\alpha$ in it
	can be represented by another mass function $s$,
	called a \emph{simple support function},
	% that places $\alpha$ of its mass on $\phi$, and the rest on the trivial event $W$.
	by placing mass $\alpha$ on the event $\phi$, and the rest $(1-\alpha)$
	on the trivial event $W$.
	%
	To combine our prior belief $m$ with the new evidence $s$,
	Shafer argues we should use Dempster's rule of combination
	% to obtain a posterior belief
	to obtain a posterior $m' := m \oplus s$,
	% given in this case by:
	which in this case equals:
	\begin{align*}
	 	m'(U) &=
		\frac{1}{\!\displaystyle 1 - \alpha \sum_{\mathclap{V \subseteq (W \setminus \phi)}} m(V)\!}
		\Big(
		(1-\alpha) m(U) +
		\alpha \sum_{\substack{\mathclap{V \subseteq W} \\ \mathllap{V \cap \phi} = \mathrlap{U}}} m(V)
			\Big).
	\end{align*}
	It is easy to verify that when $\alpha = 0$, the posterior beliefs are the same as the
	prior ones, and that when $\alpha = 1$,
	 % this has the effect of removing all
	% one effect is to discard the mass from sets
	all mass is assigned to subsets of $\phi$.
	It follows that, after the update, $\Bel_{m'}(\phi)$.
	% It follows that the posterior degree of belief
	% in $\phi$, (or any set that contains $\phi$) equals one.
	So again, we have two extremes in confidence, continuously parameterized
	by a value $\alpha \in [0,1]$.
	}
Suppose we come accross
% a piece of information
evidence
that supports an event $\phi \subseteq W$
to a degree $\alpha \in [0,1]$.
Together, $\phi$ and our confidence $\alpha$ in it
can be represented by
% a belief function $\Bel_{(\alpha,\phi)}$ that Shafer calls a \emph{simple support function}, by
the \emph{simple support function}
% $\Bel_{(\alpha,\phi)}$
\vspace{-2ex}
\[
    \qquad\Bel_{(\alpha,\phi)}(U) := \begin{cases}
        1 & 
		% \text { otherwise.}
		\text{ if }U = W \\
        \alpha & \text{ if } \phi \subseteq U \subsetneq W \\
		0 &\text{ otherwise. } \\
    \end{cases}
\]

To combine belief functions,
	Shafer argues for Dempster's \emph{rule of combination} ($\oplus$). 
% {\color{gray}
If we use $\oplus$ to combine two
%  (independent)
simple support functions for $\phi$
with degrees of support $\alpha_1$ and $\alpha_2$, we get another simple support function
for $\phi$, with combined support $\alpha_1 + \alpha_2 - \alpha_1\alpha_2$.
% We will later see that this is one canonical form of confidence. 
% What happens if we combine two (independent) simple support functions for $\phi$?
% It turns out that 
% % \[ 
% $
% 	\Bel_{(\alpha,\phi)} \oplus \Bel_{(\alpha',\phi)}
% 	 = \Bel_{(\alpha + \alpha' - \alpha\alpha', \phi)},
% % \]
% $
% which may look unnatural (but coincides with the effect of multiple updates \cref{ex:prob-simple}---so although).
% Is there an additive representation of confidence? 
% Is there always a way of representing confidence that combines additively?
% Shafer calls such a quantity \emph{weight of evidence}, and proves it must be of the form $w = - k \log (1-\alpha)$ for some $k > 0$ [\citeauthor[pg 78]{shafer1976mathematical}].
% There is; 
% In \cref{sec:vecrep}, we will see that there always also a way of measuring confidence additively. 
As we will see \cref{sec:vecrep}, confidence also has an additive form. 
In Shafer's theory, this is the \emph{weight of evidence} $w = - k \log (1-\alpha)$ for some $k > 0$ [\citeauthor[pg 78]{shafer1976mathematical}].
The additive form of confidence plays a fundamental role in Shafer's theory,
	as it does in ours.
% }

% When we use $\plus$ to combine our prior belief $\Bel$ with the evidence $\Bel_{(\alpha,\phi)}$
% upon learning $\phi$ with confidence $\alpha$,
Using $\oplus$ to combine our prior with our evidence leads to
% in this case, that means adopting the
posterior belief $\Bel' := \Bel \oplus \Bel_{(\alpha,\phi)}$,
whose plausibility measure happens to be %
	% \footnote{\label{fn:appendixproof}see the appendix for proof}
\begin{equation}
\Plaus'(U) = \frac
	{\alpha\; \Plaus(U \cap \phi) + (1-\alpha)\, \Plaus(U)}
	{1 - \alpha + \alpha\; \Plaus(\phi)}.
\label{eq:ds-plaus}
\end{equation}
It is easy to verify that
% the posterior beliefs are the same as the prior ones
$\Bel' = \Bel$
when $\alpha = 0$,
and it can also be shown
% \unskip\footnotemark[\ref{fn:appendixproof}]
that
$\Bel'(\phi) = \Plaus'(\phi) = 1$ when $\alpha = 1$.
% It follows that the posterior degree of belief
% in $\phi$, (or any set that contains $\phi$) equals one.
% Again we have two extremes in confidence,
So, as before, confidence $\alpha \in [0,1]$ parametrizes a continuous path
from ignoring $\phi$ to fully incorporating it.
%
%
\commentout{
	Alternatively, suppose that $m$ is not a probability but rather another simple support function on $\phi$. Then so is $m' = m\oplus s$.
	How much total evidence for $\phi$ does $m'$ represent?
	It is overwhelmingly standard to have a measurement that combines additively: if you had three (distinct) gallons of water and get another, you now have four; if you had six (independent) random bits and get three more, you now have nine.
	Is there an additive measure of confidence for simple support functions?
	Shafer calls such a quantity \emph{weight of evidence}, and proves that that of $s$ must be of the form $w = - k \log (1-\alpha)$ for some $k > 0$ [\citeauthor[pg 78]{shafer1976mathematical}].
	\commentout{
		Note that this is precisely the expression for $t$
		in \eqref{eq:loglogiota},
		because a choice of $\iota < 1$
		is equivalent to a choice of $k = \log(1-\iota) < 0$.
	}
	Weight of evidence
	is another important way of measuring confidence,
	and plays
	a fundemental rule in the theory of belief functions
	[\citeauthor[e.g.][Theorem 5.5]{shafer1976mathematical}]
}%
% Weight of evidence is not just additive;
% it also plays a fundemental role in the theory of belief
% functions.  For example, it provides a
% canonical (and minimal) way of decomposing
% combined evidence into simple support functions
% [\citeauthor[Theorem 5.5]{shafer1976mathematical}].
% This is why Shafer  defines it on page 8, and devotes Chapter 5 to studying it.
%
Yet the meaning of intermediate degrees of confidence can be subtle. 
	% and can be difficult to assign numerically.
In the special case where $\Bel = \Plaus$ is a probability
% a full confidence update ($\alpha=1$)
measure, 
a full confidence update ($\alpha=1$) yields the same conditioned 
probability $\Plaus' = (\Plaus | \phi)$ as in \cref{ex:prob-simple}.
Furthermore, the set of possible posteriors for intermediate $\alpha \in (0,1)$ is the same in both cases.
% Moreover, as a function of $\alpha \in [0,1]$, $\Plaus'$ is
% % a path that begins at $p$ and ends at $p|A$,
% a path that begins at $\Plaus$, ends at $(\Plaus |\phi)$,
% just like in \cref{ex:prob-simple}---%
% yet it is parameterized differently.
% yet intermediate values have different meanings.
However, the two paths are parameterized differently;
	% , the two updating procedures yield different posteriors
	in fact, 
	% the two updates disagree
% for every intermediate value of $\alpha$.
% for every $0 < \alpha < 1$.
for all $\alpha  \in (0,1)$ the two updates disagree.
% when $0 < \alpha < 1$.
% Thus, in order to appropriately determine a numerical value of confidence, we need to know something about how updates are made.
It follows that
% the appropriate numerical value of confidence $\alpha$ must depend on 
the appropriate numerical value of $\alpha$ must depend on 
more than just an intuition of ``fraction of the way to the update''.
%
%
\commentout{
	We now look at some special cases. Suppose that $\Bel_m$ is a probability measure $\Pr$, or equivalently, that $m$ only assigns mass to singletons. Then $m'$ also only assigns mass to singletons, and is given by:
	\begin{equation}
		m'(\{x\}) =
	 	\frac{\alpha\; \Pr(\{x\} \cap \phi) + (1-\alpha)\, \Pr(\{x\})}{1 - \alpha + \alpha\; \Pr(\phi)}.
	 	\label{eq:ds-prob}
	\end{equation}
	Thus, as a function of $\alpha \in [0,1]$, $m'$ is a path that begins at $\Pr$,  ends at $(\Pr |\phi)$, and can even be viewed as a ``proportion of the way to incorporation'', just like in \cref{ex:prob-simple}---%
	% yet it is parameterized differently.
	yet intermediate values have different meanings.
	% Thus, we need more assumptions in order to pin down the exact meaning of an intermediate confidence value.
	Therefore, to appropriately determine a numerical value of confidence, you need to know something more about the updating procedure.
	}%
\end{example}

%oli2:
% Shafer's two different representations of confidence---the
% degree of support $\alpha$ and the weight of evidence $w$---address
% a real difficulty with the Bayesian formalism: how to handle
% low-confidence observations gracefully.
Shafer's theory aims to address two seemingly problematic aspects of Bayesianism:
% It  allows for belief states that represent ignorance,
% and (2) for observations other than those that
it  prescribes a belief representation that can better handle ignorance, 
and enables observations other than those that ``establish a single proposition with certainty'' \parencite[Chapter 1: \S7,\S8]{shafer1976mathematical}.
% The theory is effective on both counts, but we much more interested in the second:
% Unfortunately, Shafer's solution to the second problem has not been adopted precisely because he solves the first problem, and alienates the many people who would prefer to use something other than a 
Ironically, in solving the first problem, his solution to the second becomes inaccessible to those who do not work with Dempster-Shafer belief functions. 
% The present paper, and the general notion of confidence we formalize in \cref{sec:updateformalism}
% The notion of confidence we present in this paper
% can be thought of a vast
% generalization of how Shafer handles issue (2).
%
%oli1: something bothers me about starting a paragraph with "but",
% especially without directly pushing off of something concrete about the
% way the previous sentence ended.
% But confidence can be applied far more broadly.
% This notion of confidence applies far more broadly.
% Our notion of confidence, however, applies far more broadly.
Our notion of learner's confidence directly addresses Shafer's second concern, but applies far more broadly.
%
\vnew{A significant strength of our approach that we do not take a stand on how beliefs should be represented---the concept of trust applies whether you use probability measures, belief functions, graphical models, imprecise probabilities, or something entirely different.}
% Here is an example that has similar
% We now give a quite different ,
% We now give an example with the same critical elements, but a very different flavor.
To illustrate, we 
% now give a very different example with the same critical elements.
now unpack the role of confidence in neural networks.
% of a very different flavor,
% in which confidence is measured differently.

\begin{example}
		[Training a NN]\label{ex:classifier}
% Fix a neural network $N$. We may view the ``belief state'' of $N$ as a setting $\theta \in \Theta \subseteq \mathbb R^d$ of possible weights.
% Consider a neural network, whose ``belief state'' may be viewed a setting of weights $\theta \in \Theta \subseteq \mathbb R^{d}$.
The ``belief state'' of a neural network may viewed as a setting $\theta \in \Theta \subseteq \mathbb R^d$ of weight parameters.
For definiteness, suppose we are talking about a classifier, so that
there is a space $X$ of inputs, a finite set $Y$ of labels,
% and for every $\theta$, there is a function
and a parameterized family of functions
%oli1: I can also siplify this to be a function f : X \to [0,1] if it's
% a binary classifier, which will simplify the text.
$\{ f_\theta : X \to \Delta Y \}_{\theta \in \Theta}$ mapping inputs $x \in X$ to distributions $f_\theta(x) \in \Delta Y$ over labels.
%oli1: added
In the supervised setting, an observation is a pair $(x,y)$ consisting of an input $x$ 
labeled with class $y$.

Suppose we now observe $\phi = (x,y)$
with some degree of confidence;
% If we observe $\phi = (x,y)$
% with some degree of confidence,
how should we update the weights $\theta$?
%
% \def\step{A}
% \def\step{\mathit{train}}
\def\step{\mathtt{step}}
% \def\step{G}
% In contrast with our previous examples, it is not so obvious what to
In contrast with previous examples, it is not so obvious
% do for full full confidence.
% In previous examples, we 
	% began with intuition about 
% % began with strong intuition about
% % how to handle
	% full confidence,
% 	% ---
% 	but here it is less obvious 
		% what it should mean to learn
		how to learn
		 $\phi$ with full confidence.
% Modern learning algorithms, by contrast,
Instead, modern learning algorithms
tend to be 
iterative
procedures
% $\step$
$\step: (X \times Y) \times \Theta \to \Theta$
that make small adjustments 
% $\theta' = \step(\phi,\theta)$
% $\theta \mapsto \theta' = \step(\phi,\theta)$
$\theta \mapsto \step(\phi,\theta)$
to the weights
%\unskip, corresponding to a smaller intermediate level of confidence. 
% \unskip;  to a low level of confidence. 
\unskip. 
Each step is essentially a low-confidence update.
%
There is no guarantee, for example, that 
	% $f_{\theta'}(x)$ 
	$f_{\step(\phi,\theta)}(x)$ 
gives high probability to $y$---only that it is higher than $f_\theta(y|x)$.
% Recall how conditioning is not invertible. 
This lower level of confidence is arguably what makes these learning algorithms robust to noisy and contradictory inputs. 
% \unskip\footnote{(in contrast to their historical counterparts
% 		like conjunction learning \cite{conj_learning},
%         and learning algorithms for decision trees)}
\commentout{
	In other words, such algorithms	do not take any one encounter with a training example too seriously.
	Indeed, this lower level of confidence
	is arguably what makes this learning process robust to noisy or contradictory inputs.
}\commentout{
	Modern learning algorithms (like gradient descent)
	are iterative procedures that
		% repatedly
	make incremental changes to the weights.
	Therefore, if we perform one iteration of such a procedure 
	to update $\theta$ using a labeled training example $\phi = (x,y)$ to obtain new weights $\theta'$, there is no guarantee that $f_{\theta'}(x)$ gives high probability to $y$---only that it is higher than it was before.
	 % that it assigns higher probability to $y$ than $f_{\theta}(x)$ does.
	% does not guarantee that the resulting network handles $x$ correctly.
	% In other words, the algorithm does not take any individual point too seriously.
	In other words, such algorithms 
		(in contrast to their historical counterparts like conjunction learning algorithms \parencite{conjunctions})
	do not take any one encounter with a training example too seriously---
	% \unskip that is, they make low-confidence updates.
	% \unskip that is, they make low-confidence updates to the belief state (i.e., the weights).
	\unskip that is, they make low-confidence updates to the weights.
	% So, in contrast with conditioning, there is a significant difference betwen cycling through the training data once, and doing so many times.
	This relative distrust of individual data points is arguably what makes the training process robust to noisy or contradictory observations.
}\commentout{
	As a result,
	there is a significant difference between going through the training data once
	 % (a single epoch)
	\unskip, and doing so many times.}%

% Nevertheless, this approach can still handle higher levels of confidence
Higher confidence updates 
	can be obtained by 
	applying $\step$ more than once.
	% with multiple applications of $\step$. 
\def\thetainf{\theta_\infty}%
\def\thetalim{\theta_*}%
From initial weights $\theta_0$
	and defining $\theta_{n+1} = \step(\phi,\theta_{n})$,
	we get a sequence
	$(\theta_0, \theta_1, \theta_2, ...)$
	that converges to some $\thetalim \in \Theta$.
These limiting weights fully incorporate $\phi$ 
% into $\theta_0$ in at least two senses:
% $\thetalim = \step(\phi,\thetalim)$ so $\phi$ cannot be further incorporated by $\step$,
in the sense that
$\thetalim = \step(\phi,\thetalim)$, 
and also that 
$f_{\thetalim}(x)(y) = 1$ (at least if the network is sufficiently over-parameterized), i.e., $x$ is classified as $y$ with probability 1. 
%joe2: I don't mind white lies in the introduction, as long as experts won't be uncomfortable with it.
\commentout{\footnote{%
	Note for Joe: this is a white lie; the truth depends a bit on the architecture, and we may require that the space is compact. But this can be easily achieved by considering weights taking extended real values.}}
%
Correspondingly, adopting belief $\thetalim$ is
% often expensive, and
appropriate only if we have complete trust in $\phi$,
meaning we find it critical that $x$ be classified as $y$.
(At the other extreme, 
	% of course, if we have no trust in $\phi$, we should simply
	if we have no confidence in $\phi$, we should
	not update $\theta$ at all.)
%Furthermore, our definition of a full-confidence update
%already suggests what to do for intermediate levels of confidence:
% Our definition of a full-confidence update
% also suggests what to do for intermediate levels of confidence:
% simply stop the training process before convergence.
Thus, the number of training iterations $n$
% functions as 
is a
measure of
% a description of
% how to handle
% intermediate levels of
confidence: it interpolates
% the sequence of itermediate settings of weights
%oli1: I'll save this verbage for later, per your request
% describes a path
% provides.
between no confidence (zero iterations of $\step$) and full confidence
(infinitely many iterations of $\step$).
% \footnote{Another white lie: This path can be made into a continuous path by interpolating with a line segment, and made smooth in the limit of small step sizes; we will deal with both constructions in \cref{sec:project-additive}.}
% As we will see in \cref{sec:loss-repr}, such
%joe2*: Although I didn't cut this, I don't think this is the
%right place for this point.  Among other things, you've switched out
%of the blue from %confidence being a number in [0,1] to being a number
%in [0,\infty]
%oli2: I've been persuaded to cut this, because I agree that we can
% strengthen the narrative by placing it elsewhere. However, the
% the switching from a number in [0,1] to a number in [0, \infty]
% is (in my view) unavoidable for this example.
\commentout{
	This way of measuring confidence has a convenient property:
	% starting from intital weights $\theta$,
	first updating with confidence $n$ (that is, performing $n$ training iterations),
	and then afterwards updating with confidence $m$ (so $m$ additional iterations),
	is equivalent to a single update with confidence $m+n$.
	We call a measure of confidence that behaves this way \emph{additive}.
}%
% It is also additive:
Like Shafer's weight of evidence (\cref{ex:shafer}), the number of training iterations is an additive measure of confidence.
	% first updating with confidence $n$ (that is, performing $n$ training iterations),
	% and then afterwards updating with confidence $m$ (so $m$ additional iterations),
	% is equivalent to a single update with confidence $m+n$.
	%
	% first updating with confidence $n$ (i.e., performing $n$ training iterations),
	% and then afterwards updating with confidence $m$ ($m$ more),
	% amounts to a single update with confidence $m+n$. 
	


In the simplest settings, 
training examples do not come with confidence annotations,
in which case one effectively treats them all with 
	the same default confidence (by selecting a learning rate).
	% (a number closely related to the learning rate).
The number of times that $\phi = (x,y)$ appears in a dataset
	is then the de-facto measure of confidence in $\phi$.
	% confidence in $\phi$.
% While this may be appropriate if examples are 
Often, though, these are not our intended confidences,
	which is why it can be helpful to remove duplicates 
	% \parencite{no-duplicates}.
	\citep{lee2021deduplicating}
	\unskip.
% ---% a number which is closely related to the learning rate.
In richer settings, a more nuanced degree of confidence specific to each training example often arises, 
% Richer settings sometimes have a more nuanced  of confidence specific to each training example, 
    such as agreement between annotators 
    % \parencite{kappa},
	\citep{artstein2017inter}
	\unskip,
    or
%TODO: get better citation; that paper focuses on something more
% technical and specific, but has some citations that might be better.
confidence scores in self-training \parencite{zou2019confidence}.

It is worth emphasizing that confidence
	% is not just a matter of accuracy.
	% in a training example 
	is not always just a matter of accuracy.
Suppose, for example, that the classifier is intended to screen job applications, and that we want to make hiring practices less discriminatory.
In this case, we should have low confidence in training data based on prior hiring decisions---not because it is inaccurate, but because we do not trust it to inform our new hiring practice.
\end{example}




%%%%% PARAGRAPH ON MANY DIFFERENT VIEWPOINTS
% Linear interpolation, however, is just the tip of
% At the heart of our paper is a hierarchy
%
% Let's return to the idea of incremental updating.
% If we start at an initial belief $\theta_0$,
% Our discussion of \eqref{eq:loglogiota} and Shafer's weight of evidence both



\commentout{
	\begin{figure}
	\centering
	\begin{tikzpicture}
		\begin{scope}[fill=gray,fill opacity=0.2,rounded corners=4px]
			\fill (0,0) rectangle (8,5); % URs (Full Updates)
			\fill[] (0.2,0.1) rectangle (7.8,4.5); % Flow URs (Flows)
			\fill[] (0.4,0.2) rectangle (7.6,4.0); % Diffble URs (Vec Field)
			\fill[] (0.6,0.3) rectangle (7.4,3.5); % Conservative URs
			\fill (0.8,0.4) -- (0.8, 3.0) -- (3, 3.0)
			 	to[out=0,in=0,looseness=2] (3,0.4) --cycle; % CONVEX
			\fill (7.2,0.4) -- (7.2, 3.0) -- (5, 3.0)
			 	to[out=180,in=180,looseness=2] (5,0.4) --cycle; % CONCAVE
		\end{scope}
		\begin{scope}[anchor=north]
			\node at (4.0, 5.0) {Update Rules};
			\node at (4.0, 4.5) {Flow URs~~~$f$};
			\node at (4.0, 4.0) {Diffble URs~~~$X$};
			% \node at (4.0, 3.5) {Conservative CFs~~~$\mathcal L$};
			\node at (4.0, 3.5) {Conservative CFs~~~$\mathcal L$};
			\node at (2.5, 3.0) {Convex CFs};
			\node at (4.0, 2.5) {Linear CFs};
			\node at (5.5, 3.0) {Concave CFs};
		\end{scope}
	\end{tikzpicture}
	\caption{%
		A map of different kinds of commitment functions and their representations.}
	\end{figure}
	}

\input{sections/map}

% The final paragraphs of
% The last part of
% \cref{ex:classifier} illustrates an important aspect of confidence
% Perhaps the most important application of confidence is in 
Perhaps the most important application of learner's confidence is  
% In principle, a key promise of learner's confidence is  
% to combine information from different sources with different degrees of trust.
% it allows us to update with with different degrees of trust.
% to treat different sources of information with different degrees of trust.
in treating different sources of information with different degrees of trust.
% As a result, one might imagine confidence to be relevant for sensor fusion: the problem of combining information from multiple different sensors (of varied reliability).
% The standard approach to sensor fusion is called a Kalman filter \parencite{kalman1960new,brown1997introduction}---and, indeed, comes with its own notion of confidence.
Sensor fusion, which aims to combine readings from multiple sensors of various reliabilities, 
	is a clear example---%
	and Kalman filtering \citep{kalman1960new,brown1997introduction}, the standard approach to this problem, indeed comes with its own account of confidence.
% The standard approach to sensor fusion is called a Kalman filter \parencite{kalman1960new,brown1997introduction}---and, indeed, comes with an account of confidence.

\begin{example}[1D Kalman Filter]
	\label{ex:kalman1d}
\def\estx{\hat{x}}
Suppose we are modeling a
dynamical system whose state is a real number
$x \in \mathbb R$, and we receive
% where $H$ is a matrix relating observations to state,
% $\mat z = H \mat x + \boldsymbol\xi \in \mathbb R^m$ where $H \in \mathbb R^{m \times n}$ is models a linear relation between states and observations,
% $z$ which we assume are a linear function of $x$,
noisy measurements $z$ of $x$. 
% $z$ which we assume is the value of $x$
% % plus independent centered Gaussian noise of known variance $R$.
% plus Gaussian noise.
% $\mat z = H \mat x + \boldsymbol\xi$,
% where $H \in \mathbb R^{m \times n}$, called the \emph{observation matrix} is a linear function, and $\boldsymbol\xi \sim \mathcal N(0, R)$ models random noise
% (which we assume is drawn independently from Gaussian with mean zero and covariance $R$).
% Suppose further that we re
% In many engineering disciplines, the
% standard way to track this information is
% the \citeauthor{kalman1960new} filter [\citeyear{kalman1960new}].
The Kalman Filter 
% (\citeyear{kalman1960new}) is the standard way to track this information in many engineering disciplines.
tells us how to track this information 
with 
% It prescribes
belief state $(\estx, \sigma^2)$,
where $\estx \in \mathbb R$ is our current estimate of
$x$, and
% $P \in (-\infty,\infty]^{n\times n}$,
% $P \in \mathbb R^{n\times n}$
$\sigma^2$ is an uncertainty in that estimate, in the form of a variance. 
% variance
% % (%
% % Intuitively, this corresponds to a belief
% % that
% % $\mat x \sim \mathcal N(\estx, P)$.
% % is normally distributed with mean $\estx$ and variance $P$.)
% % (Intuitively, this amounts to a belief that
% (effectively encoding the probabilistic belief
% $x \sim \mathcal N(\estx, \sigma^2)$%
% % is normally distributed with mean $\estx$ and variance $\sigma^2$.
% ).
% (symbolically: $\mat x \sim \mathcal N(\estx, P)$).
%
% Suppose we now receive an observation $\mat z$.
We now receive an observation
% How should we update these quantities in 
% response to an observation 
% $z = x + \xi$, from a sensor whose noise $\xi\sim \mathcal N(0, r^2)$ has known variance $r^2$.
$z \sim \mathcal N(x, r^2)$ from a sensor.
% For example, perhaps $\mat x = (x_1, x_2)$ is the location of an aircraft,
% and we have a sensor that
% % observes its first coordinate (plus noise), meaning that
% measures its first coordinate (plus noise $\xi$).
% This sensor's observation matrix
% $H$ then represents
% the map $(x_1, x_2) \mapsto x_1$, and we observe $\mat z = x_1 + \xi$.
How should we update our beliefs
% in response
% to $\mat z$
% to $z$
% to this new information?
\unskip?

The answer ranges from ignoring $z$ to replacing $\estx$ with it, depending on how much we trust the sensor.
%
The Kalman filter measures this trust with two (entangled) kinds of confidence: the precision $r^{-2}$ of the sensor, and a quantity $K$ called \emph{Kalman gain}.
% Using them, the updated state
The updated state
% From them, the updated beliefs
% The updated beliefs
$(\estx', {\sigma^{2\prime}})$
% $(\estx', {\sigma'^{2}})$
 % as follows:
is then:
% is
% we also have an objective quantity on which to base
% our assessment of the
\begin{align*}
	\estx' &= \estx + K (z - \estx)
	,
        %  = (1-K) \estx + (K)\,z;
		% \\
		% &
	% &
	&
	% \text{and~~}
	\sigma^{2\prime} &= (1 - K)^2 \sigma^2 + (K)^2 r^2.
    % = (I - KH) P,
    % \begin{pmatrix}
    %     \estx \\ P
    % \end{pmatrix}
    % &\gets
    % \begin{pmatrix}
    %     \estx + K (\mat z - H \estx) \\
    %      (I - K H)^{\sf T} P (I - K H) + K R K^{\sf T}
    % \end{pmatrix}
	% \\
	% &\text{where}~~ K = P H^{\sf T} (HPH^{\sf T} + R)^{-1}
\end{align*}
% we now argue that $K$
% We argue that it measures confidence in $\mat z$.
% can be veiewed as a measuring our confidence in the observation.
% acts as a ``blending factor''
% Because of the first update equation, $K$
% Often introduced as a ``blending factor'', $K$ is
% not so different from
% Observe the similarity between $K$ and 
% $\alpha$ in \cref{ex:prob-simple}:
% Much like other measures of confidence, $K$
Like the other confidence measures we have seen, $K$
interpolates (linearly) between our prior mean $\estx$ and the new observation $z$, and (``quadratically'') between our prior uncertainty $\sigma^2$ and the sensor variance $r^2$.

% Unlike the previous examples,
% Much more directly than in our previous examples,
More than in previous examples, we can also say something prescriptive about how to select a degree of confidence.
% More than in previous examples, we can also prescribe how best to select a degree of confidence.
% This is made possible by three assumptions:
% If we assume that
% If we assume that
% \commentout{
% \begin{itemize}
%     \item
%     % We know the stochastic process by which observations $\mat z$ are generated,
%     % including a number quantifying the reliability of observations (i.e., the variance $R$ of the noise $\boldsymbol\xi$ added to observations).
%     % We know how observations $z$ are generated.
%     We already have access to an objective quantification
%     of the reliability of our observation $z$,
%     through the variance $r^2$ of the noise $\xi$.
% \item
%     We would like to to select $K$ so as to minimize
%     the uncertainty in our posterior beliefs, which happens to be
%     the mean square error of our estimate $\estx$.
% \end{itemize}
% }%
% (1) we can objectively quantify the reliability of the sensor,
% (2) $z$ is independent of $\hat x$ given $x$, and that
% (3) we want to minimize uncertainty in our posterior beliefs (the mean squared error of $\estx$),
Assuming the goal is to maintain an unbiased estimate of $x$ with minimal uncertainty (as measured by expected squared error of $\estx$), 
and that $z$ is indeed the result of adding independent noise to $x$, 
% 
% Under these assumptions,
then the optimal Kalman gain is
% $K = P H^{\sf T} (HPH^{\sf T} + R)^{-1}$
% \begin{equation}\label{eq:opt-K}
$
    K_{\mathrm{opt}}
        = \ifrac{\sigma^2}{(\sigma^2 + r^2)}
    % K = P H^{\sf T} (HPH^{\sf T} + R)^{-1}
$
% \end{equation}
\parencite[p. 146]{brown1997introduction},
% and can be seen as the fraction of the variance of
% $K_{\mathrm{opt}}$ is the fraction of the total uncertainy
and $K$ is typically chosen this way in practice
\parencite{kalmanfilter.net}.
% In principle any matrix is possible,but really
% the $(i,j)^{\text{th}}$ entry of $K$ .
% , in the sense that the posterior beliefs will minimize
% the expected mean square error
% Like before, this measure of confidence can be measured as lying between two extremes.
\commentout{
Plugging this value into the update equations, we find that this choice
makes $\estx'$ the average of our prior $\estx$ and new observation $z$,
weighted by their respective variances.
}%
%
% With this in mind, let's revisit what happens extreme values of confidence.
% Having made this choice, let's return to the extremes. 
Let us now revisit the extremes. 
If $K = 0$, which
    % is prescribed by \eqref{eq:opt-K}
    is optimal
when $z$ has unbounded variance,
    the belief state remains unchanged:
intuitively, there is so much noise in observations that
    we ignore them.
%
At the other extreme, if no noise is added ($r^2=0$),
then $K_{\mathrm{opt}} = 1$ and we end up with a posterior $(z, 0)$
based solely on the new observation.
\end{example}


% The general case of Kalman filters for multidimensional state and observation, which we treat in \cref{ex:kalman-general}, is similar in spirit, but more involved.
	% and we must develop our mathematical apparatus somewhat before it fits cleanly into the picture we've laid out so far.
% \Cref{ex:kalman1d} features three distinct kinds of (un)certainty:
\Cref{ex:kalman1d} features three kinds of (un)certainty:
\begin{enumerate}[left=0.1em,nosep,parsep=\parskip]
\item \textbf{Learner's Confidence:} a subjective trust 
	% in an observation which tells how seriously to take it in updating ($K$)
	in how seriously to take an observation for updating (e.g., $K$)%
        .
% $K$, a subjective confidence in the observation $z$,  which tells how seriously to take it in updating;
        % a feature of what one knows about where $\mat z$ comes from.
        % \label{item:kalman-gain}
    \label{item:learn-conf}

\item \textbf{Internal (Epistemic) Confidence:}
        the degree of uncertainty present in a given belief state,
        either overall ($\sigma^2$)
        or in a given statement
        (e.g., the density 
			$
			\phi \mapsto 
			\mathcal N(\phi|\hat x, \sigma^2)$).
    % $\sigma^2$, a subjective uncertainty in the current estimate $\hat x$, a feature of the current belief state;
	Internal confidences in our other examples
        include the probability $\Pr(\phi)$ in
        \cref{ex:prob-simple}, the degree
        of belief $\Bel(\phi)$ in \cref{ex:shafer},
        and the value of the loss function $\mathcal L(\theta,\phi)$
        used to train the classifier in \cref{ex:classifier}.
	\label{item:epistemic-conf}

\item
    \textbf{Statistical (Aleatoric) Confidence:}
    an objective measure of the (un)reliablility of an observation,
    % based on historical information about related
    based on historical data and/or modeling assumptions about how
    observations arise
    (e.g., the noise level $r^2$)%
    .
    % $r^2$, an objective (un)reliablility of the observation $z$ (as measured by variance), a feature of the environment.
    % \label{item:measnoise}
    \label{item:stat-conf}
\end{enumerate}
The three senses of the word ``confidence'' are related,
% but are quite different.
% but play very different roles.
    but different in nature.
A great deal of work has already gone into understanding the differences between senses \ref{item:epistemic-conf} and \ref{item:stat-conf} \citep{der2009aleatory,hullermeier2021aleatoric}.
We (obviously) focus on sense \ref{item:learn-conf},
% Our examples have focused on
% quantities like \ref{item:kalman-gain},
% which we call confidences,
% and how they relate prior beliefs to posterior ones.
% and how learning uses this kind of confidence
% to posterior beliefs.
% to update beliefs.
%
% We have tried to contrast learner's confidence (sense \ref{item:learn-conf})
which we have tried to distinguish from
more pervasive usage of the word (sense \ref{item:epistemic-conf})
to quantify subjective likelihood, degree of belief, or (un)certainty.
% We will later see
% But they are related; such values may be thought of aggregated confidences
    % of past observations.
    % We explore the connections between them in the coming sections.
%	The two notions of confidence (senses \ref{item:learn-conf} and \ref{item:epistemic-conf}) are deeply related:
% Nevertheless,
% At risk of introducing confusion,  
Nevertheless,
    epistemic confidences (sense \ref{item:epistemic-conf}) may be thought of as aggregate
    reflections of learner's confidence (sense \ref{item:learn-conf}) in past observations;
	% and we will soon see further points of contact.
	% while there is a canonical way of turning a measure of epistemic confidence into a learner that takes a learner 's confidence as input.
	conversely, it is often possible to define learner's confidence by its effect on epistemic confidence
	(\cref{sec:loss-repr}).
    % ; such values may be thought of aggregated confidences
    % of past observations.
% We explore further connections in more detail in the coming sections.

% We explore the connections between them in the coming sections.
% Prescriptions for how to select a numerical value for confidence
% are often functions of quantities such as \ref{item:measnoise}.

One should also distinguish  
	learner's confidence (sense \ref{item:learn-conf}),
	at least in principle,
    from statistical confidences
    (sense \ref{item:stat-conf})
    such as the variance in readings
    of a sensor (\cref{ex:kalman1d})
    or annotator agreement (\cref{ex:classifier}).
% Under certain assumptions of independence, the two can
% Knowledge about
% When readily available,
When available,
    % how reliable a sensor
    the statistical reliability of an information source
    % should certainly inform how confident we in a given reading.
    should absolutely play a role in determining how seriously we take
    it in updating our beliefs;
	% ---but it can be difficult to come by.
%
\vnew{
learner's confidence informed exclusively by a probabilistic model can be seen as an important (``aleatoric'') special case of our theory.
%
Still, statistical confidence often presupposes that observations are drawn (independently) from a (fixed) distribution, while learners's confidence is meaningful even without such assumptions.
%
}
\commentout{%
We may not always know the variances of our sensors, and that such a quantity is well-defined is a significant assumption on its own.
Statistical confidences typically require us to know that observations are drawn independently from a fixed distribution, while learners's confidence can be meaningful even without this assumption.}

% We hope these examples have persuaded readers that confidence is ubiquitous and
% We hope that these examples have convinced the reader that confidence is ubiquitous, and given an intuitive sense of it.
% We hope that these examples have given the reader an intuitive sense of what confidence is, how ubiquitous it is, and why it is important.
% What's more, it also has a clean mathematical theory.
% But we have yet to say anything profound about it.
% We also find there to be a worthwhile underlying mathematical theory of confidence,
% which we develop in the coming sections.
% The rest of this paper is develops the theory of confidence,
%     by characterizing it axiomatically, and exploring what can be said of it in general.
% We develop
 % explores what can be said of it in general.


% We hope these examples have persuaded readers that confidence is ubiquitous and
% We hope that these examples have convinced the reader that confidence is ubiquitous, and given an intuitive sense of it.

\textbf{Contributions.}
We hope that these examples have given the reader an intuitive sense of what confidence is, how ubiquitously it arises, and why it is important.
% What's more, it also has a clean mathematical theory.
% We now argue that it
% But we have yet to say anything profound about it.
% We also find there to be a worthwhile underlying mathematical theory of confidence,
% which we develop in the coming sections.
% The rest of this paper develops the general theory of learner's confidence.
	% and exploring what can be said of it in general.
	%
% We classify confidence functions, and develop three 
% We develop
 % explores what can be said of it in general.
% We will explore the 
%
% We have already motivated the notion of confidence and illustrated some of its characteristic properties.
In the remainder of the paper, we study confidence more formally,
	making a series of successively stronger assumptions
	(all satisfied by \cref{ex:prob-simple,ex:shafer,ex:classifier,ex:kalman1d}).
Each set of assumptions enables a new more compact representation for
	a learning rule, summarized in \cref{fig:map}.
%
In \cref{sec:formalism}, we develop a formal framework 
	laying out axioms for our notion of confidence.
% We describe cannonical representations of confidence.
In \cref{sec:conf-continuum}, we focus on the properties of confidence in a continuum, developing 
	vector-field and loss-based representations of learners. 
% In \cref{sec:Bayes} we return to a
%
% We show that confidence can be measured in several equivalent ways, and classify the ways that 
This can enable simultaneous orderless updates, even in settings where it was not previously possible. 
Finally, we analyze Bayesian updating in \cref{sec:Bayes}.

% In \cref{sec:loss-repr}, we demonstrate that it is typically possible to get an even more compact representation of the updating process, by representing the vector field implicitly as gradients of some ``loss function''.


 % Once we have the formalism fully in place, we give further examples of how confidence works in exponential families, in particular showing how Kalman gain and inverse variance can be viewed as confidence as well.


 \commentout{
 This general idea can be cleaned up by appeal to differential geometry.
 Fix an input $\phi$.
 Assuming that the update paths are differentiable in the degree of confidence at any initial beleifs, the collection of updates with infinitessimal confidence forms a complete vector field $X_\phi$ over the space of beliefs, whose integral curves are paths in belief space, parameterized by confidence $\beta \in [0,\infty]$.
 % Of course, we may always convert this number back to $[0,1]$,
 We step through this more carefully in \cref{sec:field-repr}.

 %joe1*: NO!  This is not the place to bring up Reimannian metrics!
 Finally, if our belief space is endowed with a Riemannian metric, so that we may take gradients, partial update functions may be specified by a loss.}



\commentout{
	\subsection{Other Conceptions of Confidence.}

	\textbf{Probability.}
	% Probability is a numerical scale that ranges from untenable (0) to undeniable (1).
	% No number on this scale is truly neutral.
	% One of the biggest shortcomings of probability is its inability to represent a truly neutral attitude towards a proposition.
	Some people do use ``confidence'' to mean the same thing as probability. When they say they have low confidence in $\phi$, they mean that they think $\phi$ is unlikely.

	One of the biggest shortcomings of probability is its inability to represent a truly neutral attitude towards a proposition.
	%  probability of $\frac12$ .
	% This shortcoming has perhaps been the primary selling point of many alternatives to probabiltiy, such as Dempster-Shafer Belief functions.
	A value of $\frac12$ may be equally far from zero as it is from one, but is by no means a neutral assessment in all cases: hearing that your favored candidate has a 50\% chance of winning is big news if a win was previously thought to be inevitable.
	For this reason, telling someone the odds are 50/50 is quite different from saying you have no idea.
	% By contrast, zero confidence represents a truly neutral stance; a statement with zero confidence has no effect.
	By contrast, zero confidence represents something truly neutral:
		a statement made with zero confidence does not stake out a claim, and
		a statement recieved with zero confidence does not affect the recipient's beliefs.
	Nevertheless, in some contexts, we will see that confidences correspond to to probabilities.

	\textit{Opacity.} To use a graphical metaphor, think of certainty as black or white.
	Probability describes shades of gray, while confidence describes opacity.
	If we are painting with black and start with a white canvas, there is a precise correspondence between the opacity and the resulting shade of gray.

	\textbf{Upper and Lower Probabilities.}
	Upper and lower probabilities can describe a neutral attitude towards a proposition, but they are not really a specification of trust, but rather a direct specification of a belief state.
	It isn't immediately clear how to use these representations of uncertainty to update, and they're a little too complex to function effectively as the primitive measure of trust that we're after.


	\textbf{Shafer's Weight of Evidence.}
	Shafer's ``weight of evidence'' is precisely the same concept we have in mind.
	Our analysis precsely reduces to his, in the setting where belief states are Belief functions (which generalize probabilities, but not, say, neural network weights), and observations are events.
	% This paper can be a generalization of Shafer's ``weight of evidence'' to a broader class of settings, where one might have very different belief states and observations.
	Thus, this paper can be viewed as generalizing this concept to a broader class of settings, without requiring that one adopt Shafer's conception of a belief state or an observation.


	\textbf{Variance and Entropy.}
	The inverse of variance, sometimes known as precision,
		is also commonly used to measure confidence.
	If a sensor is unreliable and can give a range of answers, the variance of the estimate is a very common way of quantifying this reliablility.
	If measurements have zero variance, in some sense one has absolute confidence ($\top$) in the sensor. If measurements have infinite variance, then in some sense one has no confidence in the sensor, since individual samples convey no information about the true value of the quantity measured.
	As with probability, inverse variance will coincide with confidence in some settings; we will see how in \cref{sec:variance}.

	Entropy, like variance, is a standard way of measuring uncertainty, and in some settings, confidence coincides with entropy (see \cref{sec:entropy}).
	The assumption underlying both approaches is that there's some ``true'' value of the variable, and that the randomness is epsistemic (due to sensor errors) rather than aleotoric (inherrent in the quantity being measured).

	\textbf{Confidence Intervals and Error Bars.}
	Another notion of the word ``confidence'' comes from the term ``confidence interval''.
	This concept arises in settings involving a probability distribution $\Pr(X)$ over a metric space $X$, typically $X = \mathbb R$.
	A 95\% confidence interval is the (largest) ball containing 95\% of the probability, and its size is a geometric measurement of how .
	This intuition behind this reading of the word confidence is the same as
}
