\section{Setup}
First, we briefly overview the Neural Radiance Fields (NeRF) framework with key implementation details.
NeRF models the 3D scene as a continuous function $F_{\theta}$, which is discerned through a multi-layer perceptron (MLP). 

Specifically, given  a spatial coordinate  \( \mathbf{x} \in R^3 \) in the scene, and  a specific observation direction \( \mathbf{d} \in R^2 \), NeRF is capable of inferring the corresponding RGB color \( c \) and a discrete volume density \( \sigma \):
\vspace{-1mm}
\begin{align*}
    F_\theta : (\mathbf{x}, \mathbf{d}) \mapsto (c, \sigma).
\end{align*}

NeRF models are trained based on a classic differentiable volume rendering operation, which establishes the resulting color of any ray passing through the scene volume and projected onto a camera system. Each ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ with $t \in \mathbb{R}^+$, determined by the position of camera $\mathbf{o} \in R^3$ and the direction of ray $\mathbf{d}$. 
Note that for each $t$, $\mathbf{r}(t)$ represents a
position in $R^3$. The value of $\sigma$ defines the geometry of the scene and is learned exclusively from this position. However, the value of $\mathbf{c}$ is also dependent on the viewing direction $\mathbf{d}$. Therefore, we have the volume rendering equation as follows to represent the color on the ray $C(\mathbf{r})$:
\begin{align*}
    C(\mathbf{r}) &= \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))c(\mathbf{r}(t), \mathbf{d}) dt \,, \\
    T(t) &= \exp\left( -\int_{t_n}^{t} \sigma(\mathbf{r}(s)) ds \right) \,.
\end{align*}

Given some images with observing direction $\mathbf{d}$ and camera position $\mathbf{o}$, we can get the ground truth color $C(\mathbf{r})$ on the ray. To estimate it, we can use the NeRF and volume rendering equation to calculate $\hat{C}(\mathbf{r})$. 
To bypass the challenge of computing the continuous integral, it is common to employ a discretization method: randomly sample $N$ time $\{t_1, t_2, \ldots, t_N \}$ and get the position $\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\}$ on the ray with $\mathbf{x}_i = \mathbf{o} + t_i\mathbf{d}$. 
Then we can estimate the color by the following equation, where we denote the sampling interval $\delta_i = t_{i+1} - t_i$ and the constant $\alpha_i = 1 - \exp(-\sigma_i\delta_i)$:
\begin{align*}
    \hat{C}(\mathbf{r}) = \sum_{i=1}^N T_i \alpha_i \mathbf{c}_i, \quad T_i = \exp(-\sum^{i-1}_{j=1} \sigma_j\delta_j)  \,.
\end{align*}
% where we denote the sampling interval $\delta_i = t_{i+1} - t_i$,
% \begin{align*}
%      T_i = \exp(-\sum^{i-1}_{j=1} \sigma_j\delta_j),  \quad \alpha_i = 1 - \exp(-\sigma_i\delta_i), \quad \,.
% \end{align*}
Following this volume rendering logic, the NeRF function \( F \) is optimized by minimizing the squared error between the estimated color and the real colors of a batch of rays \( \mathcal{R} \) that project onto a set of training views of the scene taken from different viewpoints:
\begin{align*}
    L_{\text{NeRF}} = \sum_{\mathbf{r} \in \mathcal{R}} \left\| \hat{C}(\mathbf{r}) - C(\mathbf{r}) \right\|^2 \,.
\end{align*}
While NeRF achieves outstanding results in view synthesis, it traditionally demands a substantial collection of densely captured, camera-calibrated images. Addressing the difficulties of such extensive data collection, we will introduce a more efficient framework in the next section. 


% \vspace{-10mm}
\section{Framework}
\label{framework_label}

In this section, we outline our principal framework for the algorithm's design. As we need to choose training images instead of rays, we denote $\mathcal{R}$ as the set of images in this section. Given the limited number of training samples, it's essential to select a sparse but information-rich subset, $\mathcal{R}_s \subset \mathcal{R}$, to capture various details of scenes and generalization well in other views of the scenes or object. Therefore, to establish a criterion for assessing the adequacy of an image in capturing scene information, we draw upon principles from information theory to devise an appropriate metric.

\vspace{-12mm}
\begin{center}
\begin{figure*}[!t]
  \centering
  
\includegraphics[width=1.8\columnwidth]{Figures/framework.png}
    % \vspace{-3mm}
  \caption{\textbf{The overview of our framework. } First, we leverage mutual information and relative information to quantify the uncertainty in inferring unknown images conditioned on known ones. This involves decomposing the uncertainty into semantic space distance (macro) and pixel space distance (micro). These distances are converted into specific types tailored for quantifying mutual information in different scenarios. In sparse view sampling, a greedy algorithm minimizes mutual information via a near-optimal solution. We use Euclidean distance of camera positions to represent pixel space distance and propose a sequential method that prioritizes either semantics or pixels. For few-shot view synthesis, we use color distance to represent pixel space distance and maximize mutual information as efficient plug-and-play regularization terms.}
  \vspace{-4mm}
\label{fig:overview}
\end{figure*}   
\end{center}

% Subsequently, it becomes necessary to establish a criterion for assessing the adequacy of an image in capturing scene information. To this end, we draw upon principles from information theory to devise an appropriate metric. 

In information theory, mutual information quantifies the reduction in uncertainty of one variable given the knowledge of another. This concept aligns with our objectives in the context of NeRF. Specifically, we utilize the information from a known image, $R$, which includes a subset of views, to infer properties about an unknown image, $\overline{R}$. 
% Consequently, we employ mutual information to gauge the uncertainty in predicting 
% $\overline{R}$ based on $R$.

\begin{definition}[Mutual Information]
\label{def:mutual}
Mutual information measures dependencies between random variables. Given two random variables \( R \) and \( \overline{R} \), it can be understood as how much knowing \( R \) reduces the uncertainty in \( \overline{R} \) or vice versa. Formally, the mutual information between \( R \) and \( \overline{R} \) is:
% \vspace{-3mm}
\begin{align*}
    I(R, \overline{R}) = H(R) - H(R | \overline{R}) = H(\overline{R}) - H(\overline{R} | R).
\end{align*}

% \vspace{-3mm}
\vspace{-3mm} where $H(R)$ represents the information of the random variables \( R \), $H(R|\overline{R})$ represents the relative uncertainty to infer $R$ if we know $\overline{R}$.

\end{definition}

% \setlength{\textfloatsep}{15pt}


\vspace{-3mm} In the context of NeRF, $H(R)$ represents the information of the image \( R \), and $H(R|\overline{R})$ represents the relative uncertainty to infer unknown $R$ based on known image $\overline{R}$. Our objective is to quantify the mutual information $I(R,\overline{R})$ and deduce information about one image from another to a certain degree. Assuming symmetry among all images and an equal number of rays, we reasonably hypothesize that the inherent information content of each image $H(R)$ is equal. Consequently, we aim to maximize the conditional information $H(R | \overline{R})$ and $H(\overline{R}|R)$.

We then adopt both macro and micro perspectives to describe the conditional information $H(R|\overline{R})$. 

From the macro perspective, the semantic features of the entire image serve as indicators of the uncertainty in the relative information. To gauge the similarity between two images, we consider employing the CLIP method, as proposed by \cite{radford2021learning} to extract semantic features.

\begin{definition}[Semantic Space Distance]
    Suppose we have a clip function  $f$, we define the semantic space distance between images $R$ and $\overline{R}$ as the 1 - cosine similarity:
    \begin{align*}
        s(R, \overline{R}) = 1 - \frac{f(R)f(\overline{R})}{\|f(R)\|\|f(\overline{R})\|} \,.
    \end{align*}
\end{definition}

From the micro perspective, we know that we can decompose the relative information between images into the relative difference of the rays. Suppose the rays in the two images can be described as $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ and $\mathbf{\overline{r}}(t) = \mathbf{\overline{o}} + t\mathbf{\overline{d}}$. The direction can be represent as $\mathbf{d}:(\theta_1, \phi_1)$ and $\mathbf{\overline{d}:(\theta_2, \phi_2)}$, $\theta_1, \theta_2 \in U(\theta,\overline{\theta})$ and $\phi_1, \phi_2 \in U(0,2\pi)$ are sampled from uniform distribution where $\theta$ and $\overline{\theta}$ are fixed parameter. We assume the distance moving in direction $\mathbf{d}$ of two rays are $T_1$ and $T_2$. Then we define the distance between two rays as the combination of Euclidean distance in expectation between the combination of points in these two rays:

\begin{definition}[Pixel Space Distance]
    We define the distance between images in pixel space as the distance between any two points of rays in these images in expectation: 
    \begin{align*}
        d(R, \overline{R}) = E_{\mathbf{r} \in R, \overline{\mathbf{r}}  \in \overline{R}} \left [\int_{ 0}^{T_1} \int_{0}^{T_2} \|\mathbf{r}(t_1)-\mathbf{\overline{r}}(t_2)\|^2_2 dt_2 dt_1 \right] \,.
    \end{align*}
\end{definition}

Note that measuring the distance between images aligns with measuring the distance between camera positions $\|\mathbf{o}-\mathbf{\overline{o}}\|^2_2$ corresponding to these images by the following lemma:

\begin{restatable}{lemma}{camera}
\label{lem:camera}
Then the distance between two images can be represented by the Euclidean distance of two positions of cameras, $\|\mathbf{o}-\mathbf{\overline{o}}\|^2_2$, by the following equation:
    \begin{align*}
        d(R, \overline{R}) = T_1 T_2 \|\mathbf{o}-\overline{\mathbf{o}}\|^2_2 + C \,,
    \end{align*}
\end{restatable}
where $C$ is a constant independent of $\mathbf{o}$ and $\mathbf{\overline{o}}$.
Therefore, we use the measure $d(R, \overline{R})$ and $s(R, \overline{R})$ to represent the relative information $H(R| \overline{R})$ in the following assumption:

\begin{assumption}
\label{ass:inverse}
    We assume the relative information of two images $H(R| \overline{R})$ is proportional to the similarity measure and distance measure between two images, that is, 
\[
    H(R| \overline{R}) \propto s(R, \overline{R}), \quad H(R| \overline{R}) \propto d(R, \overline{R}) \,.
\]
\end{assumption}
Note that when we are predicting the information of an uncaptured image $\overline{R}$, we are not limited to using information from a single image in the training set. Rather, we can harness the collective information from multiple images, denoted $R_1, R_2, \ldots R_m$. it becomes necessary to extend the definition of mutual information to encompass multiple variables, capturing the interdependencies among more than two variables. Drawing on insights from \citep{williams2010nonnegative}, we understand that the mutual information across multiple images can be broken down into the maximal mutual information observed between any two images. The formulation is as detailed below:

\begin{definition}[Mutual Information for multiple images]
\label{def:multi}
    Suppose we have several images $R_1, R_2, \ldots R_m$ in the training set. Then we want to infer the information of an unknown image $\overline{R}$, the mutual information of this image corresponding to other images is defined as:
    \begin{align*}
        I(R_1,R_2,\ldots R_m; \overline{R}) = \max_{i=1,2,\ldots m} I(R_i, \overline{R})\,.
    \end{align*}
\end{definition}
% After introducing the framework, we will demonstrate our algorithm for solving two important tasks: sparse view sampling and few-shot view
% synthesis in the following sections.
After presenting the framework, we will illustrate our algorithm's efficacy in addressing two critical tasks which detailed in the subsequent sections: sparse view sampling and few-shot view synthesis.