	\label{sec:Bayes}

% To some, the very notion of an update is 
% To some, the notion of confidence is well-defined 
% In some sense, updates 
% To some, there is a ``correct'' of updating beliefs: Bayes' Rule. 
Many believe that ``correctly'' accounting for confidence in updating (probabilistic) beliefs
    is a matter of properly applying \emph{Bayes' Rule (BR)}. 
%
\vnew{%
To some, this simply means that belief updates are given by conditioning (i.e., $\Lrn(\mu,\top,\phi) = \mu |\phi$ with the trivial confidence domain $\{\bot,\top\}$), in which case BR is a helpful theorem. Others reject that learning necessarily establishes a proposition in the posterior with certainty (at least as far as one's belief state is concerned); for these people, BR describes the update itself. 
We now analyze these accounts of Bayesianism within our framework.
}%
% We now explore the latter approach within our framework.
%
% It does not necessarily establish a single proposition with certainty, at least from the perspective of one's belief state.
    % yet does not restrict us to learning propositions with certainty,
    % and in which only one degree of confidence is necessary. 
% According to this picture, confidence is unnecessary. 
% In this picture, 
%     % no special treatment of confidence is necessary;
%     confidence is synonymous with probability,
%     and it does not particularly meaningful to update ``with a certain degree of confidence''.
%
%
\commentout{%
For a Bayeisian a belief state is a probability 
$P \in \Delta \mathcal H$
over hypotheses
$h \in \mathcal H$,
each of which encodes a distribution 
$P(\phi \mid h)$
over possible observations.
% More precisely:
}%
% , and so in the background we also have conditional probabilities $P(\phi \mid h)$. 
\commentout{%
\unskip---so we also have conditional probabilities of the form $P(\phi \mid h)$. 
% each of which also encodes 
We now explore this important special case within our formal framework.
}%

\begin{defn}
    $\Lrn$ is \emph{Bayesian}
    iff
    \begin{enumerate}
            % [label=(\alph*),itemsep=0.2ex,parsep=0pt,topsep=0pt,leftmargin=1em]
            [label=(\alph*),itemsep=0.2ex,parsep=0pt,topsep=0pt]
        \item 
            % $\Theta$
            % in the sense that
            % there exists 
            % % a measurable space $\mathcal X = (\Omega, \mathcal F)$ and 
            % a probability measure
            % $P_\theta \in \Delta \Omega$
            % over a space $\Omega$ for each $\theta \in \Theta$;
            %
            % belief states $\theta \in \Theta$ 
            % are probability distributions over a measurable space
            belief states 
            % $\theta \in \Theta$
            correspond to distinct probability distributions over a measurable space
            $\mathcal H$ of hypotheses
            (i.e., there is an injection $\theta \mapsto P_\theta : \Theta \to \Delta \mathcal H$). 
        \item 
            there is
            a measurable space $(\mathcal X, \mathcal A)$
            in which every observation $\phi$ 
            % is associated with an event $A_\phi \in \mathcal A$;
            can be viewed as event
            % $\phi \in \mathcal A$
            (i.e., $\mathcal A \supseteq \Phi$)
            \unskip;
        \item there is a conditional probability (i.e., a Markov kernel)
            $P(X \mid H) : \mathcal H \to \Delta \mathcal X$,
            associating each hypothesis $h$ with a probability measure over $\mathcal X$;
        \item 
            there exists $\star \in \confdom$ such that, 
            % for all $\phi \in \Phi$ and $\theta \in \Theta$,
            for all $\phi$ and $\theta$,\\
            % $\exists \star \in \confdom.\forall\phi \in \Phi. \forall\theta \in \Theta$.
            $
            % P_{\normalsize\Lrn(\phi,\star,\theta)}
            P_{\!\textstyle\Lrn_\phi^\star(\theta)}\!
            ( h ) {=} P_\theta(h)  P(\phi | h)  / 
                \sum_{h'} \!P_\theta(h') P(\phi | h')
            $
            % $
            % % \[
            %     P_{
            %     \Lrn(\theta, \star, \phi)
            %     }
            %     (h) 
            %         % = P(\omega \mid \phi)
            %         % := 
            %         \propto P_\theta(h)\cdot P(\phi \mid h) 
            %         % = \theta(\omega)\cdot P(\phi \mid \omega) 
            %         %     \Big/ \sum_{\omega'\in \Omega}\theta(\omega')\cdot P(\phi \mid \omega')
            % % \]
            % $
            % for all $\phi \in \Phi$
            % for all $\phi$ and $\theta$
            \unskip.\!\!\!\!
            \qedhere
    \end{enumerate}
\end{defn}

Item (d) is Bayes' rule, and prescribes posterior the posterior belief ``$P(H | \phi)$''.
Note that $\phi$ is not an event in the sample space $\mathcal H$,
    but in the space $\mathcal X$; we regard it as event in
    $\mathcal X \times \mathcal H$
    for the purposes of conditioning the joint measure $P(X,H) := P(X|H)P_\theta(H)$.
To obtain a new belief state of the same type as the original (i.e., a distribution over $\mathcal H$), however, we must also marginalize out $\mathcal X$.
Thus, apart from its effect on the hypotheses, $\phi$ is forgotten after the update.
%
% Bayesian update
% This setting essentially involves constructing an extended sample space
% $\mathcal H \times \mathcal X$, and observing an event $\mathcal H \times \phi$.

In the special case where $P(X|H)$ is deterministic (i.e., theories are \emph{complete} enough to determine observations),
%  (i.e., $P(\phi \mid h)  \in \{0,1\}$),
the extended sample space $\mathcal H \times \mathcal X$ is not meaningfully different from $\mathcal H$, and we simply update by conditioning
    (as in \cref{ex:prob-simple} with full confidence).
% This is the most common and easily defensible reading of the word \emph{Bayesian}. 
% However, many modelers do not make this assumption; for them, marginalizing out $\mathcal X$ is a meaningful loss of memory.  
% This is sometimes called virtual evidence \citep{}.
% In such situations, although learning involves conditioning on an event, the update may not be full-confidence.  
%
% After the update, the observation $\phi$ is forgotten
%     except insofar as it has informed the posterior over $H$. 
    % ;
    % the idea is that observations are relevant only insofar as they help us discover the ``true theory''. 
% If $P$ is not deterministic, however, then 
    % adopting a posterior belief that marginalizes out $\mathcal X$ 
    % effectively forgets whether or not $\phi$ occurs.
% This myopia is not prescribed by the standard account of probabilism,
    % (and nor by many Bayesians), but it is nevertheless a common way of making sense of iterated observation. 
    % but it is nonetheless a common way of using Bayes Rule for updates. 
% Thus, 
% Nevertheless, it is common to 
% The real question is whether or not 
At the other end of the spectrum, when $P(X|H)$ has full support, Bayesian updates are characterized by optimizing learners with linear beliefs. 

% The following result characterizes Bayesian learning as optimizing learning with linear beliefs.

\begin{linked}{prop}{Boltz-Bayes}
	$\Lrn$ is a Boltzmann learner for a potential $v \ge 0$ if and only if it is Bayesian with $P(\cdot \mid \cdot) > 0$. 
	% $\Lrn$ is a Boltzmann learner for a potential $V \ge 0$ and only if it is Bayesian with $P(\cdot \mid \cdot) > 0$. 
\end{linked}

This result may not be surprising to experienced readers, although one direction of the correspondence is more subtle than it might first appear. 
% \cref{prop:Boltz-Bayes} 
It also has a significant implication: 
% \textbf{
Bayesian updating corresponds to a very special kind of optimizing learning 
% with linear beliefs,
% in a setting
where degree of belief can be viewed as the expectation of a fixed random variable.
% }
This induces significant limitations on how a given belief representation can be used---for example, high confidence updates always lead to the boundary of the probability simplex.
This rules out situations like Jeffrey's rule, for which this is not the case.
This raises some interesting questions. 
Is there a generic way to capture all learners with Bayesian updates (with a necessarily much larger belief space)? 
Alternatively, are some natural learning procedures provably incompatible with the Bayesian frame?

% We mention a relevant line of research that will appear in future work. The use of relative entropy (i.e., KL divergence) as the belief measure (instead of the linear expectation as in the Boltzman rule) leads to more subtle and arguably more balanced learning procedures. This leads \citet{mixture-langs} to an alternate natural derivation of \emph{probabilistic dependency graphs} \citep{pdg-aaai}, an extension of traditional models that go well beyond ordinary probabilistic modeling to capture inconsistency and much of machine learning.
We point out that the use of relative entropy (KL divergence) as the target of optimization (instead of linear expectation) 
appears to be far more useful in practice (e.g., in \cref{ex:classifier}).
This starting point leads \citet{mixture-langs} to an alternate natural derivation of \emph{probabilistic dependency graphs} \citep{pdg-aaai}, leading well beyond ordinary probabilistic modeling to capture inconsistency and much of machine learning.

\commentout{%
In retrospect, the theorem may seem obvious;
	translating a Bayesian update to a Boltzmann learner is as simple as taking a logarithm.
	Still, there is some subtlety going the other direction. 
This close relationship between Bayesian updates and Boltzmann reweighting seems 
to be implicitly understood in the literature, 
	but to the best of the our knowledge, has not yet been fully captured.
%
}%

% This correspondence also highlights some key shortcomings of the Bayesian update rule, as we now explore.

