	\label{sec:formalism}

Our formalism consists of three components: 
	a domain $\confdom$ of confidence values,
	a space $\Theta$ of belief states, and
	a language $\Phi$ of possible observations. 
For instance:
% For example, the learning settings in our examples are
\begin{itemize}[nosep,itemsep=1pt,left=0.5em]
    \item In \cref{ex:prob-simple}, $\Theta$ is the set of probability
    measures on some measurable space $(\Omega, \mathcal F)$,
    $\Phi$ is the $\sigma$-algebra $\cal F$, and the confidence domain
    is $[0,1]$.
    % Then
    % the context is
    % \[
    %     \Big(\Delta \Omega,~
    %         \mathcal F,~
    %          (A,\alpha,\mu) \mapsto (1-\alpha)\mu + \alpha (\mu|A)
    %     \Big).
    % \]
    \item In \cref{ex:shafer}, $\Theta$ is the set of belief functions
    over a finite set $W$, $\Phi = 2^W$ is the set of subsets of $W$,
    % and there are variants for both $[0,1]$ or $[0,\infty]$.
    and confidence is a degree of support $\alpha \in [0,1]$
    or a weight of evidence $w \in [0,\infty]$.
    \item In \cref{ex:classifier}, 
	% $\Theta = \overline{\mathbb R}^d$ is
	$\Theta \subseteq \bar{\mathbb R}^d$ is
    the space of network parameters, $\Phi = X \times Y$ is the space of 
    input-lablel pairs, and the confidence domain is the
	 	extended natural numbers $\{0, 1,\ldots, \infty\}$ under addition.
    % In \cref{sec:smooth-completion} we will extend this example to to continuous confidences in $[0,\infty]$.
	%
	\item In \cref{ex:kalman1d}, $\Theta = \Phi = \mathbb R$, 
		% $\confdom = [0,1]$ is the domain of $K$. 
		The domain of $K$ is $[0,1]$, and the domain of $\sigma^2$ is $[\infty, 0]$. 
		Together, the pair $(K, \sigma^2)$ acts as a confidence. 
\end{itemize}
We call $(\Theta, \Phi, \confdom)$ a \emph{learning setting}.
	% and we will be interested in studying the learners for a given setting. 
% Before we put them all together 
% (in \cref{sec:learning-setting-funcs})
In this setting, a \emph{learner}
 % learning
% setting $(\Theta, \Phi, \confdom)$
is a function
\commentout{\unskip\footnote{%
	% we can just as easily handle randomized updates;
	It should be straightforward to extend our theory
	so as to handle randomized updates as well;
	the point is that
	% the update can be prescribed by an algorithm.
	the belief state, observation, and confidence must together contain
	enough information to describe the updating process.
}}
% \[
$
	\Lrn : \Phi \times \confdom \times \Theta \to \Theta
$
% \]
% that must satisfy certain axioms.
% First, we want to ignore untrusted information, i.e.,
% used to incorporate new observations into the belief state.
that describes the belief update process.
Explicitly: from a prior belief $\theta$, and a statement $\phi$
	observed with some degree of confidence $\chi$,
	% it
	a learner 
	produces a posterior belief state
	$\Lrn(\phi,\chi,\theta) \in \Theta$.
We use superscripts and subscripts to
% partially specify
fix some arguments
of $\Lrn$ and view it as a function of the others. So
$\Lrn(\phi,\chi,\theta)$
can equivalently be written as
$
	\Lrn_\phi(\chi,\theta) = \Lrn^\chi_\phi(\theta)
	= \Lrn^\chi(\phi,\theta) = \Lrn_{(\theta,\phi)}(\chi)
		 % = \Lrn^\chi_\theta(\phi)
		 .
$
% We will also impose axioms to ensure 
% Most of this section is devoted to 
The rest of \cref{sec:formalism} develops axioms for $\Lrn$ and supporting concepts intended to capture intuitions about learning.

% Before we get there, 
% to describe the learning settings in the introduction, 
% But we do so in pieces ,
We proceed in three stages.
% starting with two fragments of the theory that isolate some important mathematical properties of confidence. 
After starting with an abstract theory of confidence domains $\confdom$ themselves
	(\cref{ssec:confdom}),
% next, we  theory of \emph{commitment functions}, which involve $\Theta$ and describe the update process (\cref{ssec:comm-func}). 
% next, we axiomatize confidence-based updates (\cref{ssec:comm-func}) 
we then axiomatize confidence-based updates to beliefs in $\Theta$ (\cref{ssec:comm-func}).
	% with \emph{commitment functions}.
Finally, we bring in observations $\Phi$ 
	and the function $\Lrn$
	 (\cref{ssec:full-learn}).


% \subsection{A Formal Model}
% \subsection{Abstract Confidence Functions and Domains}
% \paragraph{Abstract Confidence Domains.}
\subsection{Abstract Confidence Domains}
	\label{ssec:confdom}
A \emph{confidence domain} $(D, \le, \bot, \top, \cseq, \mathfrak g)$
% is an ordered set $D$
is a set $D$
of confidence values 
equipped with a preorder $\le$,
a 
% is an ordered set $D$ of confidence values, 
% $D$ has a 
% special least element
least element
$\bot$ (``no confidence''), a greatest element
$\top$ (``full confidence''),
and an operation $\cseq$ that combines two 
independent 
degrees of confidence.
% into a single one. 
% of confidence with a single degree of confidence.
We often abbreviate a confidence domain as $D = \confdom$,
leaving $\le$ and $\cseq$ implicit.
%
% The intuition that $\cseq$ represents \emph{independent} combination
% 	suggests that it should be commutative and associative.
\commentout{%
Because $\cseq$ represents \emph{independent} combination,
	we require that it be commutative and associative. 
}%
% We want to ignore independent untrusted information,
% and complete trust combined with another independent confidence
% remains complete trust.
% Furthermore, combining some degree of confidence $\chi$ with an independent no-confidence should have no effect on $\chi$, and combining with an independent.
We want to ignore independent information we have no confidence in, and, if already fully confident, remain so in the face of new independent information. 
Formally, this amounts to requiring, for all $\chi,\chi',\chi'' \in D$:
\def\rightdescmargin{0.2cm}
\begin{itemize}[parsep=0pt,itemsep=1pt,label={},left=\rightdescmargin]
% \begin{CDaxioms}[parsep=0pt,itemsep=1pt]
\item $(\chi \cseq \chi') \cseq \chi'' = \chi \cseq (\chi' \cseq \chi'')$
    \hfill (associativity)$\mathrlap{,}$
		\hspace{\rightdescmargin}\;\;
% \item $\chi \cseq \chi' = \chi' \cseq \chi$
%     \hfill(commutativity)$\mathrlap{,}$
% 		\hspace{\rightdescmargin}\;\;
\item $\bot \cseq \chi = \chi$
    \hfill (that $\bot$ is neutral)$\mathrlap{,}$
		\hspace{\rightdescmargin}\;\;
\item $\top \cseq \chi = \top$\;
    \hfill (and that $\top$ is absorbing)$\mathrlap{.}$
		\hspace{\rightdescmargin}\;\;
% \end{CDaxioms}
\end{itemize}
% We typically also assume that $D$ comes equipped with a topology or a differentiable structure.
Finally, $D$ comes with geometric information $\mathfrak g$, which may 
	include topology or differentiable structure.
	% , or a Riemannian metric. 
		% \unskip\footnote{see \cref{appendix:geometry} for a }
		% (At this point, the reader only needs to know is that these are increasingly specific notions of geometry, but we provide a short review in \cref{appendix:geometry} for completeness.)
% Two confidence domains are particularly common.
% We now formally introduce two
% Our work will focus on two
We are especially interested in two continuous domains
	from our examples.
 	% particularly important confidence domains 
	% 	describing a continuum of confidence,
	% 	both of which appear in our examples.
% Many of our examples use
The first is the \emph{fractional domain} $[0,1]$,
whose elements $s \in [0,1]$ represent 
	the ``proportion of the way towards complete trust''.
If you go proportion $s$ towards fully trusting something,
then $s'$ of the remaining way, then overall
you have gone
% $\alpha \cseq \alpha'
% := \alpha + \alpha' (1-\alpha) = 
% \alpha + \alpha' - \alpha \cdot \alpha'$
$s \cseq s'
:= s + s' (1-s) = 
s + s' - s \cdot s'$
    of the way to complete trust.
The other confidence domain of particular interest
 	is the \emph{additive domain} $[0,\infty]$, which is
	ideal for analogies of time and weight.
    % is roughly the amount of time you spend updating your beliefs.
% The the two are isomorphic,
	% and have a particularly rich theory.
	% and underly much of what we call ```confidence''.
	
\begin{linked}{prop}{az-iso}
	The fractional domain $[0,1]$ and the additive domain $[0,\infty]$ are isomorphic.
	Furthermore, the space of isomorphisms between them 
		% (and hence the space of automorphisms of each)
		is
	 	% naturally with $\beta \in (0,\infty)$, according to
		% itself isomorphic to the set of real numbers.
		in natural bijection with $(0,\infty)$.
	Specifically, for each $\beta \in (0,\infty)$, there is
		an isomorphism $\varphi_\beta : [0,1] \to [0,\infty]$ given by
	% \[
	% [0,1] \ni s \xmapsto - \log(1-s) 
	% \begin{array}{ccc}
	% 	[0,1] \ni s  &\mapsto& -\frac1\beta \log(1-s) \\
	% 		1- \exp(-\beta t) &\mapsfrom& t \in [0,\infty]
	% \end{array} 
	% \]
	%
	%%%%% FULL WIDTH VERSION %%%%%
	% \[
	% [0,1] \ni s = 1-e^{-\beta t} = \varphi_\beta^{-1}(t)
	% 	\qquad\text{and}\qquad
	% 	 \varphi_\beta(s)= -\frac1\beta \log(1-s) = t \in [0,\infty]
	% \]
	%
	$\varphi_\beta(s) = -\frac1\beta \log(1-s)$
	with inverse $\varphi_\beta^{-1}(t) = 1- e^{-\beta t}$.
\end{linked}

\vnew{%
The fact that these two domains are equivalent but only up to $\beta$---a ``choice of units'' in the additive domain, or ``tempering'' in the fractional domain----%
% is key fact that creates a free parameter in the definition of weight of evidence \cite{shafer1976mathematical} to the irrelevant constant in analysis of PAC learning bounds, to the choice, and demonstrates that 
implies that many standard ways of quantifying confidence are equivalent, yet also highlights the fundamental difficulty of doing so in absolute terms (as we began to see at the end of \cref{ex:shafer}).

Keep in mind that there are confidence domains as well.
The interval $[0,1]$ with $\cseq = \max$ is an important one that is not isomorphic to the additive or fractional domains. 
Confidence domains can also be multi-dimensional or discrete---but our results in \cref{sec:conf-continuum,sec:Bayes} say little about these cases.
% Confidence can lie in a discrete set, 
}%

%%% it's false that a total order imposes a unique confidence domain.
%%% so too is it false that 1d implies uniqueness. Think \max, or a graded sequence of updates. 
% \begin{prop}
% 	% Up to isomorphism $[0,\infty]$ is the
% 	There is a unique confidence domain 
% \end{prop}
%
% \paragraph{Commitment Functions.} 
% Let $\Theta$ be a set, possibly equipped with a topology and/or differentiable structure. A function $f: \confdom \times \Theta \to \Theta$ is
% a $\confdom$-\emph{commitment function} for $\Theta$ 
% if it satisfies


	% ;
    % these confidence domains have a particularly rich theory, which we develop
    % in \cref{sec:smooth}.

	% \subsection{Learning Settings and Update Functions}
	% We now introduce the critical missing components: beliefs and observations.
	% \paragraph{Learning Settings and Learners.}
	\subsection{Belief States and Commitment Functions}
	% \subsection{Modeling Belief: State and Commitment Functions}
		\label{ssec:comm-func}
	% \paragraph{Learning Functions.}
	% \paragraph{Learning Functions.}
		\label{sec:learning-setting-funcs}
	%
	% For instance:
	% % For example, the learning settings in our examples are
	% \begin{itemize}[nosep,itemsep=1pt,left=0.5em]
	%     \item In \cref{ex:prob-simple}, $\Theta$ is the set of probability
	%     measures on some measurable space $(\Omega, \mathcal F)$,
	%     $\Phi$ is the $\sigma$-algebra $\cal F$, and the confidence domain
	%     is $[0,1]$.
	%     % Then
	%     % the context is
	%     % \[
	%     %     \Big(\Delta \Omega,~
	%     %         \mathcal F,~
	%     %          (A,\alpha,\mu) \mapsto (1-\alpha)\mu + \alpha (\mu|A)
	%     %     \Big).
	%     % \]
	%     \item In \cref{ex:shafer}, $\Theta$ is the set of belief functions
	%     over a finite set $W$, $\Phi = 2^W$ is the set of subsets of $W$,
	%     % and there are variants for both $[0,1]$ or $[0,\infty]$.
	%     and confidence is a degree of support $\alpha \in [0,1]$
	%     or a weight of evidence $w \in [0,\infty]$.
	%     \item In \cref{ex:classifier}, 
	% 	% $\Theta = \overline{\mathbb R}^d$ is
	% 	$\Theta \subseteq \bar{\mathbb R}^d$ is
	%     the space of network parameters, $\Phi = X \times Y$ is the space of 
	%     input-lablel pairs, and the confidence domain is the
	% 	 	extended natural numbers $\{0, 1,\ldots, \infty\}$ under addition.
	%     In \cref{sec:smooth-completion} we will extend this example to
	%      to continuous confidences in $[0,\infty]$.
	% \end{itemize}
	%
	% A \emph{learning setting} $(\Theta, \Phi, \confdom, \Lrn)$
	%     is an updating context together with a function
	% An \emph{update function}
We now reintroduce belief states $\theta \in \Theta$
	% which come with a topology or a differentiable structure,
	% for the purpose of characterizing how confidence effects belief updates. 
	% so as to describe how confidence affects learning.
% so that we can talk about the effect of confidence on how we update our beliefs. 
in order to describe the role of confidence in belief updating.
	%%
%
%
% In this section, we describe the essential properties of of confidence in terms of functions $\Lrn_\phi : \confdom \times \Theta \to \Theta$. Many of the most important aspects of confidence can be characterized purely in terms of such functions, keeping $\phi$ entirely abstract. We call a function of this type a \emph{commitment function}, if it is obeys certain axioms.
Observations $\phi$ come later (\cref{ssec:full-learn});
we find that the most essential aspects of confidence can already be understood through the behavior of a function $F = \Lrn_\phi : \confdom \times \Theta \to \Theta$ that describes the learning process for some fixed and abstract $\phi$.
We call such a function $F$ a \emph{commitment function} if it obeys the axioms in this subsection (\cref{ax:zero,ax:combinativity,ax:cont-and-smooth,ax:acyclic,ax:seq-for-more}) intended to ensure that $F$ respects the structure of the confidence domain.
% A function $f : \confdom \times \Theta \to \Theta$,
% 	which one should think of as $\Lrn_\phi$ for some fixed $\phi$ whose nature is irrelevant, satisfying the axioms of this section 
% 	is called a \emph{commitment function}. 
% However, for 
% We consider properties
%
% 
% In order to capture our intuition of producing posterior beliefs,
% $\Lrn$ must also satisfy certain axioms.
% First, we want to ignore 
% 	% untrusted information.
% 	information in which we have no confidence.

\textbf{No Confidence.}
Having no confidence ($\chi=\bot$) in an observation $\phi$ should lead us to ignore it. 
% First, learning  ($\bot$) in an observation means we should ignore it. 

\begin{LrnAxioms}
		% [nosep,itemsep=1pt]
    \item
		% [NC]
		% [LRN1]
        % [(zero)]
		$
		\forall \phi,\theta.\quad
		% \forall \theta.\quad
		\Lrn_\phi^\bot(\theta) = 
		\Lrn_\phi(\bot, \theta) = \theta
		% f^\bot(\theta) = \theta
		$.
		% i.e., 
		% $ = \mathrm{id}_\Theta$
        % \hfill (ignore untrusted info)
        \label{ax:zero}
\end{LrnAxioms}


% \paragraph{Full-confidence updates}
\textbf{Full-confidence.}
% Next, we investigate the opposite extreme. %: learning with full confidence. 
% {\color{red}
Since the purpose of
% $F^1_\phi$
$\Lrn^\top_\phi$
is to \emph{fully} incorporate $\phi$ into our beliefs,
two successive full-confidence updates with the same information ought to have the same effect as a single one:
having fully integrated $\phi$ into our beliefs, 
% observing $\phi$ again requires no further alteration.
there is nothing to do upon observing $\phi$ again.

% In this case, we call $F$ an \emph{update rule}, or more precisely, a \emph{$\Theta$-update rule on $\Phi$}, and insist that

%joe3: This is problematic, because the issue of succiessive updates is, in general, problematic. You haven’t discussed it at all. Indeed, you haven’t even hinted that there’s an issue.
%oli3: It's actually not problematic for full confidence. That's why
% it never gets discussed when people talk about conditioning. But I
% have added a section discussing this later. 
%oli3: removing definition to shorten presentation; can equally well
% do this in a more compressed way, using just "full-confidence 
% update"
\commentout{
\begin{defn}
	% A \emph{full-confidence ($\Theta$-)update rule} (for $\Phi$) is
	A \emph{full-confidence update rule} is
	a mapping $P: \Phi \times \Theta \to \Theta$ such that
	for all $\phi \in \Phi$, 
	$P_\phi = (\theta \mapsto P(\phi,\theta)): \Theta \to \Theta$ is idempotent.
	That is,	
	$P_\phi(P_\phi(\theta)) = P_\phi(\theta)$
	 for all $\phi\in\Phi$ and $\theta \in \Theta$.
\end{defn}}


\begin{LrnAxioms}
	\item[FC]
	Full-confidence updates are idempotent.
	% For all $\phi \in \Phi$, the update $F^\top_\phi$ is idempotent.
    % That is, for all $\phi \in \Phi$,  $F^1_\phi \circ F^1_\phi = F^1_\phi$.
    That is, for all $\phi \in \Phi$,  $\Lrn^\top_\phi \circ \Lrn^\top_\phi = \Lrn^\top_\phi$.
    % That is, for all $\phi \in \Phi$ and $\theta \in \Theta$,  $F^1_\phi \circ F^1_\phi = F^1_\phi$.
    % (i.e., $F^1_\phi \circ F^1_\phi = F_\phi$).
	% Full-confidence updates are idempotent. 
	% Or,
	% equivalently,
	% % $F^1 = (\phi, \theta) \mapsto F(\phi,1,\theta): \Phi \times \Theta \to \Theta$ is a full-confidence
	% $F^\top = (\phi, \theta) \mapsto F(\phi,\top,\theta): \Phi \times \Theta \to \Theta$ is a full-confidence
	% update rule.
	\label{ax:idemp}
\end{LrnAxioms}

% In curried form, $F : \Phi \to (\Theta \to \Theta)$.

% We now proceed with the formal details.
% \textbf{Update Rules.}
% Consider a space $\Theta$
% of possible belief states,
% and a set $\Phi$ of statements.
% % and a set $\Phi$ of ``statements'', i.e., the things one can have confidence in.
% % An \emph{update rule} (or more precisely, a \emph{$\Theta$-updating rule on $\Phi$})
% An \emph{update rule}, or more precisely, a \emph{$\Theta$-update rule on $\Phi$},
% is a function of the form
% \[
%     % F :  (\mathbb R \times \Phi) \to \Big( \Theta \to \Theta \Big)
%     F :  \Phi \to \Big( \Theta \to \Theta \Big)
% \]
% % which describes how to update beliefs about $X$, with the new information, at a certain level of trust.
% which describes how to (fully) update beliefs $\Theta$ with new information $\Phi$.
% and for $F$ to be an update rule, we require that , meaning that updating any belief with $\phi$ twice in a row is equivalent to single update.
% Having said that, one reading of this paper is a relaxation of this requirement.
% Here are some examples.
Once $\Theta$, $\Phi$, and any relevant relationships between them are specified, there is often a natural choice of full-confidence update rule.
% We illustrate with three different choices of $\Phi$,
We illustrate with three examples. 
In each case, the possible belief states $\Theta := \Delta W$ be the set of all probability distributions over a finite set
 $W
  % = \{w_1, \ldots, w_n\}
  $ of possible worlds.

\begin{enumerate}[wide, label=\textit{(\arabic*)},itemsep=0.05ex,topsep=0pt,labelindent={1em}]
	\item %\textbf{Conditioning.}
	\textbf{Conditioning.}
	First, consider the case where observations are events, i.e., $\Phi := 2^W$.
	The overwhelmingly standard way to update is to condition: 
	% \[
	% \begin{aligned}
	% 	(-) \smash{\,\Big|\,} (\;\cdot\;) : \qquad 2^W &\to (\Delta W \to \Delta W) \\
	% 	A  &\mapsto (  ~\mu~~ \mapsto \mu \mid A ~),
	% \end{aligned}
	% \]
	% where $(\mu \mid A)(x) = \frac{\mu(\{x\})}{\mu(A)}$
	% in which learning $A$ maps
	% where the action of the conditional measure $\mu\mid A$ is given by $(\mu \mid A) \{w\} = \ifrac{\mu\{w\}}{\mu(A)}$.
	% where the action of the conditional measure $\mu\mid A$ is given by $(\mu \mid A)(B) = \ifrac{\mu(B \cap A)}{\mu(A)}$, provided $\mu(A) > 0$,
	starting with $P \in \Delta W$, the conditional measure 
	$P|A \in \Delta W$ is given by $(\mu|A)(B) = \ifrac{P(B \cap A)}{P(A)}$, provided $P(A) > 0$.
	Note that $(P|A)|A = P|A$, so the update is idempotent.
	% and may be defined arbitrarily otherwise.
	% and otherwise is just equal to $\mu$.
	% and is otherwise undefined, although for completeness
 	% equal to $\mu$.
	% Observe:
	% \begin{itemize}[nosep, leftmargin=1.2em]
	% 	\item Provided $\mu(A) > 0$, then $(\mu\mid A) \mid A = \mu \mid A$, so conditioning is a full-confidence update.
	% 	\item If $\mu(A \cap B) > 0$, then $(\mu \mid A) \mid B = \mu \mid (A \cap B) = (\mu \mid B) \mid A$, so the order that information is recieved does not matter (so long as it is consistent with one's beliefs).
	% \end{itemize}
	\commentout{%
	There are well-known issues with conditioning $\mu$ on $A$ when
	$P(A) = 0$, 
	and so typically this operation is left undefined. 
	}%
	% To satisfy \cref{ax:funcform,ax:idemp}, the result must either
	% be $\mu$ itself or 
	% give probability 1 to $A$.

	\item
	\textbf{Imaging }\parencite{lewis1976probabilities}\textbf{.}
	% A second example of an update rule is the ``imaging'' 
	% approach of David Lewis \parencite{lewis1976probabilities}.
	% Our second example is the ``imaging''
	% approach of \textcite{lewis1976probabilities}.
	% , albeit in very different notation.
	% Once again, consider a finite set $W$, and belief states $\Theta := \Delta W$.
	% Suppose the same setup as before, except that
	Suppose
	% , for some set $\Phi$, that
 	we already have a full-confidence update rule
	$f : \Phi \times W \to W$
	that, 
	 given $\phi \in \Phi$ and $w \in W$, produces the world $f(\phi, w) \in W$ ``most similar to $w$, in which $\phi$ is true'' \parencite{gardenfors1979imaging}.
	Idempotence of $f_\phi: W \to W$
	% amounts to the (very reasonable) requirement that the world most similar to $f_\phi w$ in which $\phi$ is true, is $f_\phi w$ itself.
	means the world most similar to $f(\phi,w)$ in which $\phi$ is true, is $f(\phi,w)$ itself.
	% From $f$, we can 
	% This allows us to 
	We can then 
	% construct a full confidence update rule for $\Delta W$
	lift $f$ to a full confidence update rule for $\Delta W$,
	% with the pushforward	ide
	% with the pushward
	by
	$%
	% \[
    	% \begin{aligned}
    		% F_\phi(\mu) &:=
    		F(\phi, P) 
				% &:=
				% f^{\sharp}(\mu)
    			% &= A \mapsto \mu( f^{-1}_\phi( A ))\\
    			% = A \mapsto 
				(A) := 
				P(\{w : f(w, \phi){ \in} A\})
    	% \end{aligned}
		% \qquad
		% \qquad
		% \begin{tikzpicture}[center base]
		% 	\node[dpad0] (W) {$W$};
		% 	\node[dpad0, right=1 of W] (W') {$W$};
		% 	\node[dpad0, below right=0.2 and 0.2 of W] (Phi) {$\Phi$};
		% 	\mergearr{W}{Phi}{W'}
		% 	\node[above=1pt of center-WPhiW']{$f$};
		% 	\draw[arr2, <-] (W) to node[above]{$\mu$} ++(-1, 0);
		% 	\draw[arr2, <<-] (Phi) to node[below]{$\phi$} ++(-1.3, 0);
		% 	\draw[arr2, <-, dashed, gray] (W') to node[above]{$F_\phi(\mu)$} ++(2, 0);
		% \end{tikzpicture}
	% \]
	$,
	intuitively moving the mass of $w$ to
	 % $f_\phi w$
	$f(\phi,w)$.	
	%$f_\phi w$, the world closest to $w$ in which $\phi$ is true.
	% is the pushforward measure of $\mu$ through $f_\phi$, which Lewis calls the ``image of $\mu$ on $\phi$''
	% And, since $f$ is idempotent, $F$ will be as well.
	Since $f$ is idempotent, so is $F$.


	\commentout{
	\item More generally, consider a measurable space $\mathcal W = (W, \mathcal A)$, where $W$ is a set and $\mathcal A$ is a $\sigma$-algebra over $W$, and let $\mathcal F \subset \mathcal A$ be closed under supersets in $\mathcal A$.
	% Now, let $\Theta$ be the set of conditional probabili$

	\TODO[Properly Use Conditional Probability Measure, to define on all events]

	Conditioning a probability distribution $\mu \in \Delta\X$ on an event $A \in \mathcal A$ also makes sense in this more general measure-theoretic setting, at least so long as $\mu(A) > 0$, and is given by
	% the Lebesgue integral
	% \[
	$$
		% (\mu \mid A) (B) = \frac{1}{\mu(A)} \int \mathbf 1_{B}(x)  \mathrm d\mu(x)
		(\mu \mid A) (B) = \frac{\mu(B \cap A)}{\mu(A)}
	$$
	}


	\item
	\textbf{Jeffrey's Rule.}
	% Once more, suppose that $W$ is a finite set and $\Theta := \Delta W$.
	% Next, consider a more general form of observation, in which observations themselves are probabilities.
	% Next, consider a more general form of observation, in which observations themselves are probabilities.
	% Our final example is a way in which 
	% 	people have historically tried to augment conditioning 
	% 	to allow for uncertain observations. 
	% Recall form the introduciton
	% that Jefrey's Rule, a widely used generalization of
	% conditioning
	% Both of the previous approaches establish a single event
	The two previous approaches to updating establish an event with probability 1.
	Jeffrey's rule ($\mathit J$) addresses this limitation
		by allowing for uncertain (i.e., probabilistic) observations.
	% But Jeffrey updates are still full-confidence.
	%
	% Suppose observations themselves are probabilities.
	Formally, let $\Phi$ be the set of pairs $(X,\pi)$
	% Formally, suppose $\Phi$ consists of marginal distributions $\pi(X)$
	where $X : W {\to} S$ is a random variable taking values in a set $S$,
	% (i.e., some function of $W$),
	and $\pi \in \Delta S$ is a probability on
	% the possible values that $X$ can take.
	$S$.
	Jeffrey's update rule is:
	% The rule is then:
	% Jeffrey's rule prescribes the posterior
	$
	% \begin{align*}
		% \mathrm{Jeffrey}_{(X,\pi)}
		% \mathrm{Jeffrey}_{\pi(X)}
		% \mathrm{J}_{\pi(\mskip-2muX\mskip-2mu)}
		% {J}_{(X,\pi)}(
		{J}((X,\pi),
			P) := \sum_{x \in S} \pi(X{=}x)  P \big|
            (X{=}x).
            % \{ w : X(w) = x \}
			% \\
			% &= A \mapsto \sum_{x \in S} \pi(X{=}x)\, \mu( A \mid X \!= x)
	% \end{align*}
	$
	%
	When $\pi$ places all mass on some $x \in S$, $\mathit J$ conditions on $X {=} x$.
	% and so it is sometimes thought to generalize conditioning so that 
	% absolute certainty is no longer necessary.
	For this reason, $\mathit J$ is thought to
		generalize conditioning 
		to observations of ``lower confidence''.
	 % but for other choices of $\pi(X)$,
	% For this reason, Jeffrey's Rule is sometimes often thought of as a generalization of conditioning that admits for less that complete certainty.
	Yet even when $\pi$ is not deterministic, $J$ \emph{fully} incorporates
	% $J_{(X,\pi)}$ 
	% $J$
	% is idempotent;
	% also, 
	$\pi$ into the posterior beliefs:
	% since the marginal of $J((X,\pi),\mu)$ on $X$ is $\pi$,
	the marginal of $J((X,\pi),P)$ on $X$ is $\pi(X)$,
	and the prior belief 
	% about $X$
	$P(X)$ has been destroyed.
	Indeed, $J_{(X,\pi)}$ is idempotent. 
	% Therefore $J$ is still a full-confidence update rule---just 
	% % one that handles observations that can be uncertain.
	% one that handles a different kind of observation.
	% So $J$ still makes updates with full-confidence---it just handles
	% 	observations that do not indicate high probability.
	%
	Therefore, $J$ still establishes observations with full confidence---%
		it's just that those observations are probabilities.
	%
	% This is another historical conflation between
	% We attribute this to a conflation between
	% We imagine that this mismatch has contributed to
	% the historical conflation between confidence and likelihood.
	%
	% We imagine that this mismatch is a result of a historical conflation of confidence with likelihood.
	% We submit that this mismatch is a result of a historical conflation of confidence with likelihood.
	% We have seen many struggle with this concept
	%
	% We submit that this (often counter-intuitive) behavior is clarified enormously by a concept of confidence distinct from likelihood.
	Experience suggests that this point can be counter-intuitive; we submit that the confusion is clarified by a conception of confidence distinct from likelihood.
	% We submit that this (often counter-intuitive) behavior is clarified enormously by a concept of confidence distinct from likelihood.
	%  that this mismatch is a result of a historical conflation of confidence with likelihood.
	% Let $\mu' := J_{\pi(X)}(\mu)$ be the result of applying Jeffrey's rule for $(X,\pi)$ to $\mu$.
	% % then $\pi$ will be fully incorporated (that is, $\mu'(X) = \pi(X)$),
	% Note that $\mu'(X) = \pi(X)$, so $\pi(X)$ has been fully incorporated into $\mu'$, while all information about the old prior belief about $X$ has been destroyed by the update.
\end{enumerate}

\vnew{%

}

\cref{ax:idemp} implies that full-confidence updates are not invertable: they destroy information in the prior, often making for a simpler posterior. 
This potential simplification of future calculations is a major benefit of fully trusting information.
% We will soon see another major benefit: full-confidence updates 
However, full-confidence updates are extreme.
An agent that updates by conditioning, for instance,
% is permanently commited to believing everything it ever learns with perfect certainty,
permanently commits to believing everything it ever learns
(and thus gains nothing from making the same observation again later). 
% and gains nothing from making the same observation twice.
 % (\cref{ax:idemp})
% Humans don't work this way. The effectiveness of flash cards as a learning tool demonstrates this clearly: if we were using an update rule, two cycles through a deck of flash cards would be no different from one.
Clearly humans are not like this; revisiting information
 	helps us learn \parencite{ausubel1965effect}.
Similarly, artificial neural networks are trained with
 	many incremental updates, and benefit from seeing 
	the training data many times.
% Indeed, this is one biggest differences between modern machine learning techniques and  older rule-based ones: modern algorithms update parameters little-by-little, rather than fully incorporating input information.
% Once an agent that uses conditioning incorporates $A$, it is forever committed to believing $A$, and as a side effect, there is no point to making
We would like an account that allows for less extreme belief alterations,
in which information is only partially incorporated.
This is the role of intermediate degrees of confidence.


% Our axioms already have some implications for it. 
% Our axioms so far already tell us 
% Because $\Lrn$ reflects the order structure of $\confdom$ (\cref{ax:ineq-witness})
% 	and $\top$ is the largest element, it follows that 
% 	for all $\chi$, there is some $\chi'$ such that 
% 	$\Lrn_\phi^{\chi'}$

% although it will turn out that the mathematical study of the objects at hand is equivalent if we simply take the approach above (\cref{theorem:}). 




% Next, for geometry. 
\textbf{Geometry.}
Learner's confidence interpolates between 
	ignoring new information and fully defering to it,
	% thus, we would like the path of intermediate confidences to 
	and we would like that interpolation to be continuous and differentiable.
	
\begin{LrnAxioms}
	\item
	If $\confdom$ and $\Theta$ are both topological spaces, then 
	for all $\theta$ and $\phi$, 
	the map
	$
	\Lrn_{(\theta,\phi)} = 
	\chi \mapsto 
	\Lrn(\theta,\chi,\phi)
	$
	is continuous.
	If $\confdom$ and $\Theta$ are both manifolds, then 
	$\Lrn_{(\theta,\phi)}$ is 
	differentiable---%
	% twice
	% differentiable.
	and also $\Lrn_\phi^\chi$ is differentiable on a subset $\Theta_\phi$ defined in \cref{prop:maximal-continuous-theta} below.%
		\label{ax:cont-and-smooth}
\end{LrnAxioms}


Ideally the posterior would be continuous in our prior beliefs as well.
This suggests
% similar priors typically result in similar posterior beliefs.
% This would allow us to strengthen \cref{ax:cont} to something simpler:
% This suggests
a simpler strengthening of \cref{ax:cont-and-smooth}:
% \begin{LrnAxioms}[nosep]
% 	\item
% 	[L{\the\numexpr\value{LrnAxiomsi}\relax}${^\prime}$]
% 	$\Lrn_\phi :\confdom \times \Theta \to \Theta$ 
% 	is continuous
% 		(resp. differentiable)
% 	for all $\phi \in \Phi$.
% 	\label{ax:cont-strong}
% \end{LrnAxioms}
that
% $\Lrn_\phi :\confdom \times \Theta \to \Theta$ 
$\Lrn_\phi$ 
also be continuous (and differentiable) as a function of $(\chi,\theta)$%
%
% Unfortunately,
% % \cref{ax:cont-strong} 
% that is too strong to handle our examples at full confidence.
% In the probabilistic case, for instance:
---yet this is often too much to ask for.

% Axiom \cref{ax:cont-strong} says more---it says that the posterior 
% belief is also continuous in the prior beliefs, 
% which also seems appropriate. But this assumption has significant bite.
% \commentout{%
% actually, this shows a problem with defining 
%
% \begin{example}
% 	Again let $W$ be a finite set, and choose disjoint non-empty subsets
% 	$A, B \subset W$ with $A \cap B = \emptyset$.
% 	Let $p\ne q$ be two distinct distributions over $W$ supported
% 	on $A$, and $d$ be one suppoerted on $B$. Now, consider 
% 	% and consider a sequence $(\mu_i)_{i \in \mathbb N}$ of positive probability 
% 	% distributions over $W$ whose limit 
% 	% is the point mass $\delta_w$ on a particular world $w \in W$.
% 	% $\mu^*$ has support $A \subsetneq W$ (i.e., $\mu^*(A)=1$).
% 	the two sequences of probability distributions
% 	\[
% 		\Big(p_n= (1-e^{-n}) d + (e^{-n}) p \Big)_{n \in \mathbb N}
% 		,
% 		\qquad
% 		\Big(q_n = (1-e^{-n}) d + (e^{-n}) q \Big)_{n \in \mathbb N},
% 	\]
% 	both of which have limit $d$. But every $p_n | A = p$ while every $q_n |A = q$, so
% 	now \cref{ax:cont-strong} implies that 	
% \end{example}%
\begin{linked}{prop}{no-continuous-condition-ext}
	% % There is no continuous extension of conditioning to a function
	% % $F$ satisfying \cref{ax:cont-strong}.
	% There is no extension of conditioning that satisfies \cref{ax:cont-strong}.
	% %
	% % That is, if $(W, \mathcal F)$ is a measurable space $\Phi = \mathcal F$,
	% % and $\Theta$ consists of all probability measures on $(W, \mathcal F)$, then
	% % there is no continous function $F : $
	% % In particular, 
	% That is,
	% % if $\Theta = \Delta W$ and $\phi\subset W$ is an event,
	% for $\phi\subsetneq W$,
	% there is no continuous function
	% $F_\phi : \Delta W \times [0,1] \to \Delta W$
	% such that $F_\phi(\mu, 1) = \mu|\phi$ when $\mu(\phi) > 0$. 
	% % nor even one whose restriction to  $ [0,\epsilon) \times \Phi \times \Theta \to \Theta$
	% % is continuous, for $\epsilon > 0$.
	Take $\Theta = \Delta W$ and $\phi \subseteq W$.
	There exists no continuous function $\Lrn_\phi : \Delta W \times [0,1] \to \Delta W$ 
	with the property that 	$\Lrn_\phi(\mu, 1) = \mu|\phi$ when $\mu(\phi) > 0$. 
\end{linked}

% % This is a consequence of the fact that there's no 
% % continuous extension of conditioning that handles
% % observations of events that have probability zero.
% %
% Intuitively, though, this is just an edge case; we can still get continuity
% if we never observe an event we believe has probability zero. 
% % So, rather than insist that updates are always continuous in our priors, 
% % Rather than insist that updates always be continuous in our priors,  we simply take note of a set of priors for which 
% Rather than insisting on this stronger axiom or giving up on it entirely, we can get something in between 
% % with a proposition instead of an axiom
% % with a proposition in place of the axiom: 
% % learning a particular $\phi$ is continuous. 
% with the following definition.

This result is yet another perspective on the familiar difficulties with conditioning on events of probability zero
	\citep{},
% Nevertheless, one can def
but intuitively this should be an edge case.
Instead of imposing an axiom, 
	we observe that it is possible to 
	capture the phenomenon in a useful way even at this abstract level. 

\begin{linked}{prop}{maximal-continuous-theta}
	% For all $\phi \in \Phi$,
	% there is a maximal open set $\Theta_\phi \subseteq \Theta$ such that
	% the restriction 
	%
	% Given $\phi \in \Phi$, 
	% let $\Theta_\phi \subseteq \Theta$ be the maximal set 
	For all $\phi \in \Phi$, 
	there is a maximal open set 
	$\Theta_\phi \subseteq \Theta$ such that
	the restriction
	$
	% F_{\phi} |_{\Theta_\phi} : 
	\Lrn_{\phi} |_{\Theta_\phi} : 
		% [0,1) \times \Theta_\phi \to \Theta
		[\bot,\!\top) \times \Theta_\phi \to \Theta
	$		
	of 
	% $F_\phi$
	$\Lrn_\phi$
	to $\Theta_\phi$ is continuous. 	
\end{linked}
% \begin{linked}{defn}
% 	Let $\Theta_\phi$ 
% \end{linked}
In our examples, $\Theta_\phi$ consists of those
belief states that do not flatly contradict $\phi$.
In \cref{ex:prob-simple}, \cref{prop:no-continuous-condition-ext,prop:maximal-continuous-theta}
imply that $\Theta_\phi = \{ \mu \in \Delta W : \mu(\phi) > 0\}$
is the set of distributions for which conditioning on $\phi$ is defined.
% \footnote{
% For those familiar with the basic anatomy of an ML system: 
% in \cref{ex:classifier}, if $\phi=(x,y)$, then $\Theta_{\phi}$ is the set of weights for which the gradients $\nabla_{\theta}\ell(f_\theta(x), y)$ of the loss function $\ell$ are finite.
In \cref{ex:classifier}, $\Theta_{(x,y)}$ is the set of parameters at which gradients $\nabla_{\theta}\ell(f_\theta(x), y)$ of the loss $\ell$ are finite.
 % }


\textbf{Order.}
% {\color{red}%[clunky; rewrite]
% For learning, lower confidence means a more conservative update. 
% The characterstic feature of $\chi <  \chi'$ is that
% The meaning of 
For a learner, the defining feature of
	the ordering $\chi <  \chi'$ is that
% From the learner's perpsective, the meaning of $\chi \le \chi'$ is that
% To say one degree of confidence is smaller than the other ($\chi \le \chi'$) means that
learning with higher confidence ($\chi'$) can done by first 
making the more conservative, lower-confidence ($\chi$) update, followed by a nontrivial residual update.
% updating with confidence $\chi$ is more conservative than updating with $\chi'$,
%  	in the following sense:
%  	% one can get the effect of learning $\phi$ with higher confidence by first learning with lower confidence, and then making a second update with some residual degree of confidence.
% 	the effect of learning $\phi$ with higher confidence can be achieved by first learning with lower confidence, then updating with some residual confidence.
% }

\begin{LrnAxioms}[nosep]
	\item 
	% There exists a continuous function $s(\chi_2, \chi_1)$ defined on pairs for which $\chi_2 \ge \chi_1$, such that $\Lrn_\phi(s(\chi_2,\chi_1),\Lrn_\phi(\chi_1,\theta)) = \Lrn_\phi(\chi_2,\theta)$.
	 $\exists s : \{ (\chi', \chi) : \chi' > \chi_1 \} \to \confdom$ continous such that $\Lrn_\phi(s(\chi',\chi),\Lrn_\phi(\chi,\theta)) = \Lrn_\phi(\chi',\theta)$.
	 \label{ax:ineq-witness}
	 \label{ax:seq-for-more}
	\commentout{%%%%% This axiom turns out to be the point we have to strengthen.
	\item
	% When $\chi \le \chi'$, there exists some $\chi'' \in \confdom$ such that
	% $\Lrn_\phi^{\chi} \circ $
	% $\forall \phi, \theta,\chi,\chi'.\quad$
	$\forall \theta,\chi,\chi'.\quad$
	$\chi < \chi'$ 
        % if and only if\\
        $\quad\implies$ \\
        % there exists $\chi'' \le \chi'$ such that
        \phantom{a}$\quad
		%  \exists \chi'' < \chi'.~$
		 \exists \chi''\!.\, \bot {<} \chi'' {\le} \chi' \text{ and}$
		%%% version with f
		% $f^{\chi''} \circ f^\chi = f^{\chi'}$.
		%%% version with sub and superscripts
		% $\Lrn_\phi{\chi''} \circ \Lrn_\phi^\chi = \Lrn_\phi^{\chi'}$.
		%%% version with subscripts and \theta	
		$\Lrn_\phi^{\chi''} \!{\circ}\, \Lrn_\phi^\chi (\theta) = \Lrn_\phi^{\chi'}\!(\theta)$.
		%%% version with subscripts and no \theta
		% $\Lrn_\phi^{\chi''} \circ \Lrn_\phi^\chi = \Lrn_\phi^{\chi'}$.
 		%%% version with parens
		% $\Lrn(\phi, \chi'', \Lrn(\phi, \chi, \theta))  = \Lrn(\phi, \chi', \theta)$.
		% \!\!\!\!
        \label{ax:ineq-witness}
		\label{ax:seq-for-more}
	}%
	\commentout{%
	\item[\cref*{ax:seq-for-more}$^{<}$]
	% When $\chi \le \chi'$, there exists some $\chi'' \in \confdom$ such that
	% $\Lrn_\phi^{\chi} \circ $
	$\forall \phi, \theta,\chi,\chi'.\quad$
	$\chi < \chi'$ 
        $\quad\iff\quad$
        $\exists \chi'' < \chi'.~~$
		$\Lrn_\phi^{\chi''} \circ \Lrn_\phi^\chi (\theta) = \Lrn_\phi^{\chi'}(\theta)$.
		\label{ax:ineq-witness-strict}
		\label{ax:seq-for-more-strict}
	}%
\end{LrnAxioms}

Furthermore, learning is not cyclic: if learning with confidences $\chi_0$ and $\chi_1$ have the same effect, then the same is true of all confidences $\chi_0 \le \chi \le \chi_1$ between them.

\begin{LrnAxioms}
	\item If $\chi_0 \le \chi \le \chi_1$ and $\Lrn_\phi(\chi_0, \theta) = \Lrn_\phi(\chi_1, \theta)$, then $\Lrn_\phi(\chi,\theta) = \Lrn_\phi(\chi_0, \theta)$.
		\label{ax:acyclic}
\end{LrnAxioms}
% \cref{ax:idemp} and \cref{ax:acyclic} together imply that $\top$ is absorbing had we not already assumed it. 

\textbf{Independent Combination.}
% \color{orange}[MOVEME]
$\Lrn$ should be used to incorporate information to the extent that it is novel,
i.e., information that is not already accounted for in our prior beliefs.
% \TODO[INDEPENDENCE DISCUSSION]
%
Thus, we would like a sequence of two independent
observations in the same observation $\phi$ to
be equivalent to a single observation of $\phi$ 
with their combined degree of confidence.
% ; see
 % \cref{ssec:indep-shafer} for further discussion of this matter.

\begin{LrnAxioms}[nosep]
	\item 
	% $\forall \phi, \theta,\chi,\chi'.\quad$
	$\!\forall \phi, \chi,\chi'.~
	% $
	% $\forall \theta,\chi,\chi'.\quad$
	% $\forall \chi,\chi'.\quad$
	%%% version with scripts and no $\theta$
	% $\Lrn_\phi^\chi \circ \Lrn_\phi^{\chi'} = \Lrn_\phi^{\chi\cseq\chi'}$
	%%% version with parens
	% $\Lrn(\phi, \chi, \Lrn(\phi, \chi', \theta)) =  \Lrn(\phi, \textbf{}\chi\cseq\chi',\theta)$.
	%%% version with \phi subscript
    % $
	\Lrn_\phi(\chi, \Lrn_\phi(\chi', \theta)) =  \Lrn_\phi( \chi\cseq\chi',\theta)$%
	%%% version with f
	% $f^\chi \circ f^{\chi'} = f^{\chi\cseq\chi'}$
        \label{ax:combinativity}%
\end{LrnAxioms}
% 
% In the language of algebra, \cref{ax:zero} and \ref{ax:combinativity} (and \cref{ax:cont-and-smooth}) together require $\Lrn_\phi$ to be a (differentiable) {action} of the monoid $(\confdom, \cseq, \bot)$ on $\Theta$.
\vnew{%
% \cref{ax:combinativity} is our strongest axiom. 
\cref{ax:combinativity} appears to be a rather strong assumption. 
% \cref{ax:combinativity} implies \cref{ax:zero} (since $\bot$ is a neutral element with respect to $\cseq$) and \cref{ax:idemp} (since $\top$ is absorbing), for example.
Since $\top$ is absorbing,
for example, \cref{ax:combinativity}
implies
\cref{ax:idemp}.
% In the language of algebra, \cref{ax:zero,ax:combinativity} require $\Lrn_\phi$ to be an {action} of the monoid $(\confdom, \cseq, \bot)$ on $\Theta$.
In the language of algebra, \cref{ax:zero} and \ref{ax:combinativity} (and \cref{ax:cont-and-smooth}) together require $\Lrn_\phi$ to be a (smooth) {action} of the monoid $(\confdom, \cseq, \bot)$ on $\Theta$.
}%
%
However, if we are free to chose the confidence domain, \cref{ax:combinativity} imposes no other restrictions on $\Lrn$ 
	(see \cref{prop:free-additivity} in the appendix). 
It is also easy to verify that the confidences $\alpha$ and $n$ of \cref{ex:prob-simple,ex:shafer,ex:classifier} satisfy \cref{ax:combinativity}.
% For a specific confidence domain, 
% For our two favorite confidence domains, 
Nevertheless, for the canonical domains $[0,1]$ and $[0,\infty]$, 
% we will soon see that
\cref{ax:combinativity} is indeed a strong assumption.
	% it implies \cref{ax:zero} (since $\bot$ is neutral) and \cref{ax:idemp} (since $\top$ is absorbing).
		% and \cref{ax:seq-for-more-strict}.
In fact, of the confidences in \cref{ex:kalman1d}, neither $K$ alone nor $\sigma^2$ satisfy \cref{ax:combinativity} out of the box---but sensor precision $\sigma^{-2}$ does when $K = K_{\text{opt}}$ is the optimal gain, 
	 and the pair $(K,\sigma^2)$ can be combined into a single domain satisfying \cref{ax:combinativity}, as we show in the appendix.
%
%
% This says nothing about how to combine confidences in different
% observations, nor about the effect of $\phi'$ between the two updates.
\commentout{\color{gray}%
Keep in mind that \cref{ax:combinativity} applies only for two
	observations of the same statement $\phi$.
}%
% A reader with a background in algebra might observe that \cref{ax:zero} and \ref{ax:combinativity} are together equivalent to requiring that $\Lrn_\phi$ be a monoid action of $(\confdom, \cseq, \bot)$ on $\Theta$. What about the remaining structure of the confidence domain: the top element ($\top$) and the order ($\le$)? What should it mean for $\Lrn$ to respect these parts of a confidence domain?
%
% In the language of algebra, \cref{ax:zero} and \ref{ax:combinativity} (and \cref{ax:cont-and-smooth}) together require $\Lrn_\phi$ to be a (smooth) {action} of the monoid $(\confdom, \cseq, \bot)$ on $\Theta$.


\commentout{%
% This allows us to define
This suggests that we could use $\Lrn$ to define the 
 	belief states in which ``$\phi$ is true'' to be
	image of this projection (i.e., the set of fixed points of $\smash{\Lrn^{\top}_\phi}$);
% ($\im \Lrn^\top_\phi$)
% as the 
after all, it is easily shown that learnining $\phi$ with any degree of confidence (i.e., applying $\Lrn_\phi^\chi$) has no effect on these states. 
This illustrates a general point: if the function $\Lrn$ captures the
	belief updating process, we can use it to understand the relationship between $\Phi$ and $\Theta$ at an abstract level.
In \cref{ex:classifier}, for instance, 
		although the network weights $\Theta$ are an uninterpreted subset of some high dimensional space, 
		the training process $\Lrn$ 
		arguably imbues them with meaning by defining
			a connection between them and the training examples.
					
% However, 
% to some readers, this may seem backwards:
%
% However, u
However, to some readers, using $\Lrn$ to define truth may seem backwards.
In a given learning setting, we may already have a sense of
which belief states $\theta$ correspond to full belief in $\phi$---in \cref{ex:prob-simple}, for instance, 
% belief states are probability measures, and so already ascribe (a degree of) truth to events $\phi$. 
	they are the measures that give $\phi$ probability 1.
In such cases, we may want additional axioms ensuring that 
	any relationships between $\Theta$ and $\Phi$ implicit in $\Lrn$
	are compatible with the ones we already have. 
}
% and want to require that $\Lrn\!^\top$ projects to the \emph{correct} subspace.
Our axioms so far have been conditions on the separate commitment functions $F: \confdom \times \Theta \to \Theta$, which we have called ``$\Lrn_\phi$'',
	% $
	% % \{
	% \Lrn_\phi 
	% % \}_{\phi \in \Phi}
	% $, 
	but we have not required that $F= \Lrn_\phi$ have any relationship to observations $\phi$. 
 	% we want to impose another axiom ensuring that this definition of
	% truth lines up with one
	%  % we already had in mind.
	%  already present.
	% analogous to \cref{ax:zero}
% To address this, we must introduce the remaining formalism.
To address this, we must reintroduce the final pieces of our formalism.
% additional structure.

\commentout{%
% Yet the monoid $(\confdom, \cseq,\bot)$ is not a 
Beyond the data of this monoid, a confidence domain $D$ also has 
	an order ($\le$),
	a geometry,
	and an absorbing top element ($\top$). 
% What about the remaining structure of the confidence domain: the top element ($\top$) and the order ($\le$)? What should it mean for $\Lrn$ to respect these parts of a confidence domain?
% The question of what it means for $\Lrn$ to preserve this structure leads us to additional axioms characterizing full-confidence udpates and, respectively. 
What should it mean for $\Lrn$ to preserve this additional structure? 
% One answer is \cref{ax:ineq-witness}, which we discuss in the appendix,
% but it is perhaps more intuitive
% But there is also another answer.
% The answer is what makes learner's confidence so unique. 
}%


% For the domains $[0,\infty]$ and $[0,1]$,
% % \cref{ax:ineq-witness} and its strict analogue L4{$^{<}$} are direct consequences of \cref{ax:combinativity}. 
% \cref{ax:ineq-witness} follows from \cref{ax:combinativity}. 





% \paragraph{Belief and Learning.}
% \subsection{Modeling Observations: Learning, Belief, and Structural Symmetry}
% \subsection{Modeling Observations: 
% 	Degree of Belief, and Structural Symmetry}
\subsection{Observations and Degree of Belief}
	\label{ssec:full-learn}
% Recall that a \emph{learning setting} is a triple $(\Theta, \Phi, \confdom)$
% consisting of
%     a space $\Theta$ of beliefs,
%     a language $\Phi$ of observations,
%     and a confidence domain $\confdom$.

% \textbf{Belief.~}
% In the setting $(\Theta, \Phi, \confdom)$,
Consider a function $\Bel : \Theta \times \Phi \to \confdom$
that associates each belief state $\theta$ with 
	% a ``confidence'' in each observation $\phi$. 
	a degree of belief in each statement $\phi$. 
	% a function $\Bel_\theta$ 
	% that gives a degree of belief to each observation. 
% This form of confidence 
% The output of $\Bel$ is an \emph{internal} confidence---like a prior probability, not a learner's confidence.
This usage of $\confdom$ represents ``confidence'' in the standard sense of likelihood, rather than of trust.
% The output of $\Bel$ is 
%
Still, we can use $\Bel$ to articulate another key desideratum for the latter:
	learning $\phi$ with more confidence should lead to more belief in $\phi$.
%
% Because $\confdom$ is a confidence domain, we also have some structure.
% For example
%
% Our primary reason for defining $\Bel$ is that 
% 	we would like $\Lrn$ to be monotonic with respect to $\Bel$.
%     That is, learning $\phi$ with more confidence
%     should lead to more belief in $\phi$.

\begin{LrnBelAxioms}[nosep]
	\item 
	$\forall \phi,\theta,\chi,\chi'.\quad$
	$\chi \ge \chi'
	% \quad\implies\quad
	$\\$
	\implies
	% $\forall \phi,\theta,\chi \ge \chi'.~~
	\Bel(\phi, \Lrn(\phi,\chi,\theta)) \ge \Bel(\phi, \Lrn(\phi, \chi', \theta))
	% \Bel_\phi \circ \Lrn_\phi(\chi,\theta) \ge \Bel_\phi \circ \Lrn_\phi(\chi', \theta)
	$.
		\label{ax:monotone}
\end{LrnBelAxioms}

We cannot ask for strict monotonicity, however:
if we already fully believe $\phi$ (i.e., $\Bel(\phi,\theta) = \top$),
% we cannot attain a higher degree of belief by learning $\phi$.
there is no way to attain a higher degree of belief, we cannot attain a higher degree of belief by learning $\phi$.
% Next, if we already believe $
% Instead, if we fully believe $\phi$, learning $\phi$ should
Instead, if we fully believe $\phi$, learning $\phi$ should
	have no effect.

\begin{LrnBelAxioms}[nosep]
\item If $\Bel(\phi,\theta) = \top$, then
    $\Lrn(\phi,\chi,\theta) = \theta$. 
    \label{ax:truth-is-enough}
\end{LrnBelAxioms}
% 
% % Finally, perhaps a
% The converse of \cref{ax:truth-is-enough}
% % $\Bel$ also allows us to be more precise about the effect of
% an intuitive characterization of full-confidence updates:
% if we learn something with full confidence, then we fully believe it.
% 
Perhaps even more importantly, 
if we learn something with full confidence, then we ought to fully believe it.

\begin{LrnBelAxioms}[nosep]
    \item $\Bel(\phi, \Lrn(\phi,\top,\theta)) = \top$.
        \label{ax:effectiveness}
\end{LrnBelAxioms}

\commentout{%
While
\cref{ax:effectiveness} is certainly desirable,
		it may not always hold in cases of interest.
	% setting up the model so that this is true may not be worthwhile.
	In \cref{ex:classifier}, for instance,
	% In the case of training a classifier ( \cref{ex:classifier}), for instance,
		it is natural to set $\Bel(\theta,(x,y)) = f_\theta(y|x)$,
	and there may be a local maximum $\theta$ of the parameterization
		$\theta \mapsto \Bel(\theta,(x,y))$ that is not a global one.
	In this case, there is no continuous monotonic path from $\theta$ to a global maximum $\theta^*$ for which $f_{\theta^*}(y|x) = 1$,
	(i.e., no way to satisfy \cref{ax:monotone,ax:effectiveness,ax:cont-and-smooth}).
}%

While
\cref{ax:monotone,ax:effectiveness,ax:cont-and-smooth}
are serious constraints on $\Lrn$ if $\Bel$ is given,
one can easily define $\Bel$ based on $\Lrn$ so as to ensure that 
\cref{ax:monotone,ax:effectiveness,ax:cont-and-smooth}
% hold for trivial reasons.
hold trivially.


% In the abstract setting where $\Theta$ and $\Phi$ have no a priori, 
% however, axioms \cref{ax:monotone,ax:truth-is-enough,ax:effectiveness}
% have no bite.
% \begin{linked}{prop}{synthetic-bel}
% 	For every learner $\Lrn$, there exists a 
% 	believer $\Bel$ such that the pair $(\Lrn, \Bel)$ satisfy 
% 	\cref{ax:monotone,ax:truth-is-enough,ax:effectiveness}
% \end{linked}

% Thus, a no-confidence update simply discares the new observation.
% At the opposite extreme, we call $F^1_\phi$ a \emph{full update}. 
% The appropriate way to deal with full confidence depends
% At the opposite extreme, the appropriate way to deal with full confidence
% The appropriate way to deal with full confidence, on the other hand,
% depends on the relationship between $\Theta$ and $\Phi$.

% but it still characterized by an important property.
% but it can still be characterized at this level of generality.

% \textbf{High Confidence Updates.}


% Not only 
% 
% \begin{CFaxioms}
% 	\item[\ref*{ax:idemp}']
% 	% Full-confidence updates are idempotent.
% 	% For all $\phi \in \Phi$, the update $F_\phi$ is idempotent.
%     % That is, for all $\phi \in \Phi$,  $F^1_\phi \circ F^1_\phi = F^1_\phi$.
%     % That is, for all $\phi \in \Phi$ and $\theta \in \Theta$,  $F^1_\phi \circ F^1_\phi = F^1_\phi$.
%     % (i.e., $F^1_\phi \circ F^1_\phi = F_\phi$).
% 	$F^\chi_\phi \circ F^\chi_\phi = F^\chi_\phi$ iff $\chi \in \{0,1\}$
% 		or $\phi$ is trivial (in the sense that $F^\chi_\phi(\theta) = \theta$ for all $\phi,\chi,\theta$).
% 	\label{ax:idemp-strong}
% \end{CFaxioms}
% 




% Now, we're not quite in the same position as Shafer.
% Shafer was prescribing a concrete representation of $\Theta$ (a belief function) and a concrete update rule $F$ (Dempster's rule of combination), and so he needed to defend these choices.
% We only need to defend something much more modest: we only need to defend the assumption that, if $\Theta$ and $\Phi$ properly model the relevant aspects of the scenario at hand, then there exists \emph{some} function $F$ which performs updates appropriately.


% \subsection{
% 	% Continuity and
%  	% The Path of Middling Confidence
% 	% Paths of Intermediate Confidence
% 	Intermediate Confidence Values
% 	}



% $\theta$ such that
% the loss $\mathcal L(\theta, \phi) < \infty$ that the training algorithm minimizes is finite.

\commentout{%
\textbf{Symmetry.}
We would also like update rules to preserve any joint symmetries between the belief space $\Theta$ and the observation language $\Phi$.
For instance, in \cref{ex:prob-simple}, we would like to require that updates are not sensitive to irrelevant relabelings of points.
% Concretely, let $\mathrm{Aut}(X, \Phi)$ be the set of automorphisms $\sigma : X \to X$, together with an action on assertions, so that $\sigma\phi \in \Phi$ is the appropriately relabeled assertion equivalent to $\phi$ after the relabeling.
Concretely, assume we have some set $\mathrm{Aut}(\Theta, \Phi)$ of 
	structural symmetries
	(in the form of automorphisms $\sigma : (\Theta \sqcup \Phi) \to (\Theta \sqcup \Phi)$)
	% $\sigma : \Theta \to \Theta$ 
	that have an action both on belief states ($\sigma(\theta) \in \Theta$) and 
		on observations ($\sigma(\phi) \in \Phi$).
	% (say, rotations of the simplex of distributions), that also have an associated action on assertions, so that $\sigma\phi \in \Phi$ is the corresponding relabeling of $\phi$ under $\sigma$.  
		The symmetry condition can now be captured by:
\begin{LrnAxioms}
	\item
	% For all 
	$\forall \theta,\phi,\chi,~~
	 \sigma
	% : X \to X
	% \in \mathrm{Aut}(X, \Phi)$, we have
	\in \mathrm{Aut}(\Theta, \Phi)
	% $, we have
	.\quad$
% \\ \indent\hspace{2em}
% $F^\beta_A(\sigma_\#(\Pr)) = \sigma_\#\Big(F^\beta_{\{\sigma(a) : a \in A \}}(\Pr)\Big)$,
% $F^\beta_\phi (\sigma_\#(\Pr)) = \sigma_\#\Big(F^\beta_{\sigma\phi}(\Pr)\Big)$,
% $F^\beta_{\sigma\phi} (\sigma(\theta)) = \sigma \Big( F^\beta_{\phi}(\Pr)\Big)$.
% $\Lrn_{(\sigma(\theta),\sigma(\phi))} = \sigma \circ \Lrn_{(\theta,\phi)}$.
% $\Lrn_{(\sigma\theta,\sigma\phi)} = \sigma \circ \Lrn_{(\theta,\phi)}$.
$\Lrn(\sigma(\phi), \chi , \sigma(\theta)) = \sigma (\Lrn(\theta,\chi, \phi))$.
	% \hfill \textbf{(symmetry)}
	 \label{ax:symmetry}
\end{LrnAxioms}
}
