\section{Sparse View Sampling}
\label{Section_view_sampling}

% In this section, we mainly discuss how to utilize the mutual information between images to guide the sampling of training data. 
% {\color{blue}Add the motivation of this setup, add some reference, we can not get the ground truth, it is high cost}
% Sparse view sampling is an active learning scheme proposed by ActiveNeRF~\cite{pan2022activenerf}. In this setting, we are given a candidate set of viewpoints but we don't initially get the corresponding ground truth images. At first, We may only have a constrained number of training images, and we analyze the inherent imperfection of the existing training images and acquire more images from some candidate viewpoints for better synthesis quality of NeRF models. Imagine we are allowed to only capture three pictures of the Eiffel Tower because of some constraints. We're presented with a set of potential positions to capture, whether from the sky or the ground. How to choose these views based on pictures we have to better present the Eiffel Tower is excatly what sparse view sampling does.
Sparse view sampling, proposed by ActiveNeRF~\citep{pan2022activenerf}, is an active learning scheme designed to enhance the quality of NeRF by strategically selecting additional viewpoints. In this setting, we begin with a limited number of training images, and a candidate set of viewpoints for which we \textbf{do not possess the corresponding ground truth images}. \textbf{It is only after a viewpoint is selected that we acquire its ground truth image}, subsequently transferring it from the candidate to the training set.
By analyzing the shortcomings of initial images, we strategically select additional viewpoints and then get the corresponding images to improve the NeRF model's synthesis quality. 
For instance, if constrained to capture only three images of the Eiffel Tower, we are presented with various potential viewpoints from the sky or ground. Sparse view sampling involves selecting the most informative viewpoints based on the initial images.

% In this section, we focus on the utilization of mutual information between images to guide sparse view sampling. Sparse view sampling involves selecting a limited number of views to observe or capture a scene. A key challenge in this context lies in efficiently selecting the limited set of views to gain more comprehensive scene information, aiding in the inference of unobserved portions.

% In our framework, the selection of an informative subset of views is guided by minimizing their mutual information. It is motivated by the observation that smaller mutual information indicates reduced relevance between views. For instance, when two images are highly similar or identical, they exhibit high mutual information, but selecting both of them may be unnecessary and result in redundant information. Therefore, we aim to design an algorithm to intelligently choose images that minimize their mutual information.

% Our framework selects an informative subset of views by minimizing mutual information. It stems from the observation that lower mutual information reflects reduced redundancy between views. For example, highly similar images exhibit high mutual information, indicating redundancy if both are selected. Thus, we aim to design an algorithm that intelligently chooses images with minimal mutual information, enhancing the representation efficiency.

Our framework selects an informative subset of views by \textbf{minimizing} \textbf{mutual information} without knowing the ground truth images beforehand. It stems from the observation that lower mutual information reflects reduced redundancy between views. For example, highly similar images exhibit high mutual information, indicating redundancy if both are selected. We aim to design an algorithm that intelligently chooses images based solely on the existing images and the candidate view positions.

% In our framework, selecting a informative subset of views is interpreted as their mutual information is small. It is because that the small information means the less relevant bettween them. For example, if two pictures are so similar or just the same, they gain high mutual information, but it may be costly to choose them for sparse view sampling.
% Therefore, we need to design an algorithm to pick suitable images to minimize the mutual information of them.

% First of all, we consider the task of choosing our training images. As we can get the ground truth of these images, we expect their mutual information becomes smaller to contain more information in the whole set. Therefore, we need to design an algorithm to pick suitable images to minimize the mutual information of them.

First, let's consider a global optimization problem. Suppose the whole set of images is $\mathcal{R}$ and we need to choose the subset of images $\mathcal{R}_s$. We represent $R_{i \neq j}$ as all the images in $\mathcal{R}$ without the image $R_j$, then our goal can be formally described as minimizing the mutual information for $R_{i \neq j}$ and $R_j$. By Definition~\ref{def:multi}, it can be represent as the maximal mutual information between $R_i$ and $R_j$:
\begin{align*}
    \min_{\mathcal{R}_s \subset \mathcal{R}} \max_{R_j \in \mathcal{R}_s} I(R_{i\neq j}; R_j) = \min_{\mathcal{R}_s \subset \mathcal{R}} \max_{R_i, R_j \in \mathcal{R}_s} I(R_i, R_j)\,.
\end{align*}

% Suppose we have $N$ figures in the subset $\mathcal{R}_s$ and by Definition~\ref{def:mutual},
% we can also change the goal of minimizing mutual information to maximize relative information between images. So we rewrite this problem to the following problem: 

Given $N$ figures in the subset $\mathcal{R}_s$, we can reformulate the goal from minimizing mutual information to maximizing relative information between images by Definition~\ref{def:mutual}. Thus, the problem becomes:
\begin{align*}
    \max_{\mathcal{R}_s \subset \mathcal{R}} \delta\,  
        \;\;\textrm{s.t.} \; H(R_i|R_j)\ge\delta,  \forall i,j\in \{1,2,\ldots N\},\,i\neq j \,.
\end{align*}
% Then we use the solution of this problem as the training images which contain more information.
Then we use the solution as the training images to construct an informative NeRF.
% for better scene understanding.

\subsection{Greedy Algorithm}
\label{greed_algorithm}
% Note that solving this problem is really hard because we do not know the ground truth of all the picture in the beginning and we need to consider and balance $O(N^2)$ constraints.
% Therefore, we need to design a near-optimal approximation algorithm which is trackable and reduce computational cost.
% Note that solving this problem is challenging as we lack ground truth information for all images initially, and we must contend with balancing $O(N^2)$ constraints. Therefore, we need to design a near-optimal approximation algorithm that is tractable and reduces computational costs.

Solving this problem is challenging without initial ground truth images for all candidate viewpoints and involves balancing $O(N^2)$ constraints. Thus, we adopt a near-optimal approximation algorithm that is both tractable and computationally efficient. We use a look-ahead strategy and greedy method to select views. Over $N$ iterations, we choose an image in each iteration that has minimal information overlap with the already selected images. In the 
$i$-th iteration, we solve the following problem:
\begin{align*}
        \max_{R_i \in \mathcal{R}} \delta_i
        \;\;\textrm{s.t.} \; H(R_i|R_j)\ge\delta_i,\forall 1\le j< i \,.
\end{align*}
Then the mutual information of $N$ images we choose is $\Tilde{\delta} = \min\{\delta_1, \delta_2, \ldots, \delta_N\}$.
Although this algorithm can not achieve the global minimum point of the primal problem, it is a 2-approximation based on the following lemma:

\begin{restatable}{lemma}{greedy}
\label{lem:greedy}
Assume the optimal value of the primal problem is $\delta$, the value we achieved by the greedy algorithm is $\Tilde{\delta}$, then we have $\Tilde{\delta} \ge \frac{1}{2}\delta$.
\end{restatable}

This lemma ensures that our greedy algorithm provides a good approximation to the optimal solution. Additionally, our algorithm substantially reduces the computational cost of the problem, as we only have $O(N)$ constraints in each instance, as opposed to 
$O(N^2)$. We will subsequently employ this iterative strategy for image selection in our experiments.
 
% This lemma make sure that our greedy algorithm is a good approximation of the optimal solution. Moreover, our algorithm significantly reduce the computational cost of the problem as we only have $O(N)$ constraints in each problem instead of $O(N^2)$. Then we will apply this strategy to choose images in experiments.  



\subsection{Experiments}
% Our greedy algorithm in Section~\ref{greed_algorithm} is similar to the setup in Active Learning settings~\cite{pan2022activenerf}, where the goal is to supplement the initial constrained training dataset iteratively with newly captured samples. In this setting, we start by training a NeRF model using the initial observations. Then we render candidate views and estimate them to select the most valuable ones. We proceed to further train the NeRF model using the newly acquired ground-truth images corresponding to these selected views.

% \textbf{Design} We adhere to the 'train-render-evaluate-pick' scheme, with a modification in the evaluation step. In this phase, our objective is to minimize mutual information, taking into account both semantic space distance and pixel space distance.

% Our greedy algorithm in Section~\ref{greed_algorithm} is similar to the 'train-render-evaluate-pick' scheme in Active Learning settings~\cite{pan2022activenerf}: 1).start by training a NeRF model using the initial observations. 2).render images from candidate views and estimate them to select the most valuable ones. 3).train the NeRF model using the newly acquired ground-truth images corresponding to these selected views and go back to 2) for specific times.
\textbf{Setup} 
Our greedy algorithm in Section~\ref{greed_algorithm} follows a 'train-render-evaluate-pick' scheme similar to that in Active Learning~\citep{pan2022activenerf}: 1) start by training a NeRF model with initial observations, 2) render images from candidate views and evaluate them to select valuable ones, 3) train the NeRF model with the newly acquired ground-truth images corresponding to these selected views, then repeat to step 2. Compared to ActiveNeRF, we modify the evaluation metric in step 2 as minimizing mutual information, considering both semantic space distance and pixel space distance discussed in Section~\ref{framework_label}.

% \textbf{Design} We adhere to the 'train-render-evaluate-pick' scheme, with a modification in the evaluation step. In this phase, our objective is to minimize mutual information, taking into account both semantic space distance and pixel space distance.


\textbf{Design}
By Assumption~\ref{ass:inverse} and Lemma~\ref{lem:camera}, we identify a viewpoint that exhibits both low semantic similarity measured by CLIP~\citep{radford2021learning} (large semantic space distance) and a considerable distance in camera positions (large pixel space distance). If we consider only camera pose, furthest view sampling (FVS) is optimal. However, incorporating semantic constraints necessitates balancing these two criteria. We propose a sequential approach: first prioritize semantics to select a subset from candidates, then evaluate based on camera pose (S$\rightarrow$P), or vice versa (P$\rightarrow$S). This strategy navigates the tradeoff without 
a tricky balance hyperparameter. The technical appendix provides more discussions.


\begin{table*}[t]
\centering
\begin{tabular}{c|c c c|c c c}
\hline
 \multirow{2}{*}{\textbf{Sampling Strategies}} & \multicolumn{3}{c|}{\quad ~ \textit{Setting \uppercase\expandafter{\romannumeral1},~ 20  observations:} ~ \quad} &  \multicolumn{3}{c}{\quad  ~ \textit{Setting \uppercase\expandafter{\romannumeral2},~ 10  observations:} ~ \quad}\\
 
 & \textbf{PSNR} $\uparrow$ & \textbf{SSIM} $\uparrow$ & \textbf{LPIPS} $\downarrow$  & \textbf{PSNR} $\uparrow$ & \textbf{SSIM} $\uparrow$ & \textbf{LPIPS} $\downarrow$\\
\hline
\hline
NeRF + Rand    & 16.626 & 0.822  & 0.186 & 15.111 & 0.779 & 0.256 \\
NeRF + FVS(\textbf{Pixel})  &  17.832   & 0.819 & 0.186 & 15.723 & 0.787 & 0.227 \\
NeRF + \textbf{Semantic} & 17.334 & 0.833 & 0.171 & 15.472 & 0.795 & 0.219 \\
ActiveNeRF  \quad  & 18.732 & 0.826 & 0.181 & 16.353 & 0.792 & 0.226 \\
\hline
\hline
Ours~(S$\rightarrow$P) & 18.930 & \textbf{0.846} & \textbf{0.149} & 16.718 & \textbf{0.810} & \textbf{0.205}\\
Ours~(P$\rightarrow$S) & \textbf{20.093} & 0.841 & 0.162 & \textbf{17.314} & 0.801 & 0.209\\
\bottomrule
\end{tabular}

\caption{\textbf{Quantitative comparison in Active Learning settings on Blender.} \textbf{NeRF + Rand:} Randomly capture new views in the candidates. \textbf{NeRF + FVS(Pixel):} Capture new views using furthest view sampling to maximize pixel space distance. \textbf{NeRF + Semantic:} Capture new views using CLIP to maximize semantic space distance.  \textbf{ActiveNeRF:} Capture new views using the ActiveNeRF scheme. \textbf{Ours~(S$\rightarrow$P):} First choose 20 views with the highest semantic space distance, then capture views within them based on the highest pixel space distance (camera pose). \textbf{Ours~(P$\rightarrow$S):} First capture 20 views with the highest pixel space distance, then capture views within them based on semantic space distance. \textbf{Setting \uppercase\expandafter{\romannumeral1}:} 4 initial observations and 4 extra observations obtained at 40K,80K,120K and 160K iterations. \textbf{Setting \uppercase\expandafter{\romannumeral2}:} 2 initial observations and 2 extra observations obtained at 40K,80K,120K and 160K iterations. 200K iterations for training in total. All results are produced using the ActiveNeRF codebase.} 
\vspace{-3mm}
\label{tab:active}
\end{table*}



\begin{table*}[t]
\centering
\begin{tabular}{c|c c c}
\hline
 \multirow{2}{*}{\textbf{Sampling Strategies}} & \multicolumn{3}{c}{\quad ~ \textit{Setting \uppercase\expandafter{\romannumeral1},~ 20  observations:} ~ \quad}\\
 
 & \quad \quad \textbf{PSNR} $\uparrow$ \quad & \quad \textbf{SSIM} $\uparrow$ \quad & \quad \textbf{LPIPS} $\downarrow$ \quad \\
\hline
\hline

\quad \quad semantic distance + 0.1 $*$ pixel distance	\quad \quad & 18.781	& 0.833	& 0.153\\
semantic distance + pixel distance	& 19.266	& 0.837	& 0.159\\
semantic distance + 10 $*$ pixel distance	& 18.345	& 0.821	& 0.187\\

\hline
\hline
Ours~(S$\rightarrow$P) & 18.930 & \textbf{0.846} & \textbf{0.149}\\
Ours~(P$\rightarrow$S) & \textbf{20.093} & 0.841 & 0.162\\
\bottomrule
\end{tabular}

\caption{\textbf{Ablations on balancing two metrics.} We use hyperparameters for pixel and semantic space distances, considering both factors simultaneously. Our sequential approaches (\textbf{Ours~(S$\rightarrow$P)} and \textbf{Ours~(P$\rightarrow$S)}) outperform the alternatives.}
\vspace{-3mm}
\label{tab:abliation_sparse_2}
\end{table*}


\begin{figure}[t!]
    \centering
    \includegraphics[width=0.99\linewidth]{Figures/ActiveLearning_2.pdf}
    \caption{\textbf{Quantitative comparison in Active Learning settings on Blender.} Given limited input views, our strategy can select better candidate views. Our rendered images without excessively blurry boundaries exhibit greater clarity compared to those rendered by ActiveNeRF. }
    \label{fig:activenerf}
    \vspace{-4.5mm}
\end{figure}

\textbf{Dataset}
% We extensively demonstrate our approach in two benchmarks, including Blender~\cite{mildenhall2020nerf} and LLFF datasets~\cite{mildenhall2019local}. The Blender
% dataset contains 8 synthetic objects with complicated geometry and realistic materials. Each scene has 100 views for training and 200 for
% the test, and all the images are at 800×800 resolution.  LLFF is a real-world dataset consisting of 8 complex scenes captured with a cellphone. Each scene contains 20-62 images
% with 1008×756 resolution, where 1/8 images are reserved for the test.
% We report the image quality metrics PSNR, SSIM, and LPIPS for evaluations.
% We extensively demonstrate our approach on the  Blender~\cite{mildenhall2020nerf} dataset. The Blender
% dataset contains 8 synthetic objects with complicated geometry and realistic materials. Each scene has 100 views for training and 200 for
% the test, and all the images are at 800×800 resolution.
% We report the image quality metrics PSNR, SSIM, and LPIPS for evaluations.
% SSIM measures the differences between the properties (luminance, contrast, and structure) of the pixels while PSNR just checks the absolute error between the pixels in a micro way. LPIPS quantifies perceptual similarity in a relatively macro way.
\textbf{Dataset and Metric}
We extensively evaluate our approach on the Blender~\citep{mildenhall2020nerf} dataset, which contains 8 synthetic objects with complex geometry and realistic materials and is classical in the NeRF research.
% Each scene has 100 views for training and 200 for testing, with all images at an 800×800 resolution. 
We report the image quality metrics PSNR, SSIM, and LPIPS for evaluations. SSIM measures differences in luminance, contrast, and structure, focusing on perceptual properties. PSNR assesses the absolute error between pixels, emphasizing pixel-wise comparison in a micro way. LPIPS quantifies perceptual similarity, capturing more global visual differences in a macro way.




\textbf{Results} We demonstrate the performance of our sampling strategy on the Blender dataset compared to baseline approaches in Table~\ref{tab:active} and Figure~\ref{fig:activenerf}. Our strategy outperforms baselines in view synthesis quality. Our method, which considers both the semantic space distance between visible and invisible views and a tendency towards uniform sampling, provides better sampling guidance under a limited input budget.
When prioritizing semantic space distance before pixel space distance (\textit{Ours (S$\rightarrow$P)}), we observe lower LPIPS scores (-17.6\%/-9.2\%) and higher SSIM scores (+2.4\%/+2.3\%), aligning more closely with human perception. Conversely, prioritizing pixel space distance first (\textit{Ours (P$\rightarrow$S)}) yields higher PSNR scores (+7.3\%/+5.9\%), reflecting differences in raw pixel values. In addition, as shown in Table~\ref{tab:abliation_sparse_2}, our sequential method can get better results than simultaneous method.

\textbf{Ablation} We conduct ablation studies using only semantic space distance or only pixel distance. As shown by \textit{NeRF + FVS(Pixel)} and \textit{NeRF + Semantic} in Table~\ref{tab:active}, considering either factor improves performance compared to the naive method. However, combining both metrics yields even better results, as seen in \textit{Ours (S$\rightarrow$P)} and \textit{Ours (P$\rightarrow$S)}.

% We show the performance of our sampling strategy on Blender over baseline approaches in Table~\ref{tab:active} and Figure~\ref{fig:activenerf}. Our strategy outperforms baselines on the quality of view synthesis.
% It can be easily seen that our method, which takes into account the semantic space distance between visible and invisible views along with a tendency towards uniform sampling,  provides better guidance for sampling under a limited input budget.
% When we prioritize the semantic space distance before the pixel space distance (\textit{Ours (S$\rightarrow$P)}), In two settings, we observe lower LPIPS (-17.6\%/-9.2\%) and higher SSIM scores (+2.4\%/2.3\%), which align more closely with human perception in a macro perspective. Conversely, when first prioritizing the pixel space distance (\textit{Ours (P$\rightarrow$S)}), we obtain higher PSNR scores (+7.3\%/+5.9\%), which are based on raw pixel value differences in a micro perspective.
 % More experiment details are provided in the Appendix. 

% \subsubsection{ActiveNeRF.}
% We validate the performance of our proposed framework, ActiveNeRF, and compare it with two heuristic approaches. As an approximation, we hold out a large fraction of images in the training set and use these images as candidate samples. For baselines, we denote \textit{NeRF+Random} as randomly capturing new images in the candidates. \textit{NeRF+FVS (furthest view sampling)} corresponds to finding the candidates with the most distanced camera position compared with the current training set. We empirically adjust the number of the initial training set and captured samples during the training procedure.

% We first show the results with continuous learning scheme, where the time and computation resources are considered sufficient. The comparison results are shown in Table \ref{tab2} and Figure \ref{main2}. We can easily see that ActiveNeRF captures the most informative inputs comparing with heuristic approaches, which contributes most to synthesizing views from less observed regions. The additional training cost for ActiveNeRF is also comparably minor (2.2h vs. 2h).

% We further validate the model performances with Bayesian estimation. As shown in Table \ref{tab2}, $75\%$ of the time consumption can be saved. Although showing inferior performance to continuous learning, the model with Bayesian estimation still synthesize reasonable images and is even competitive with heuristic approaches under continuous learning scheme.



\section{Few-shot View Synthesis}
% In this section we consider another task: optimizing the parameter of neural network by limited and fixed training images. A natural idea in this task is generating some images using our algorithm. Although these images do not have the ground truth, we can utilize the information of them. 
% In this setting, the objective is to maximize, rather than minimize, the mutual information between the training and generating images. 
% This strategy is based on the assumption that the ideal distribution for both known (training set) and unknown (test set) examples is identical, thereby fostering a more unified and coherent model performance. 

% Then we use our framework to design the regularization function to optimize neural network and improve the generalization ability to new views. 

% {\color{blue}Add the ablation study of the second regularize term}
In this section, we address the challenge of few-shot view synthesis, which is more prevalent in NeRF research: optimizing the NeRF model
% for enhanced generalization 
with limited and fixed training images. The key is to extract valuable information from the training set while maintaining generalization capabilities.

A natural approach involves randomly rendering images from NeRF that lack ground truth and  leveraging the information extracted from them.
Based on this, our objective is to \textbf{maximize}\textbf{ mutual information} between visible training images and invisible inferred images,  expecting that inferred images without ground truth can gain more relevant information from known images.

To tackle this, we introduce two regularization terms  to train a generalizable NeRF model.

\subsection{The Design of Regularization Term} 
\label{design_regulatization}
% By the framework, maximizing the mutual information between images is equivalent to minimizing the semantic space distance and pixel space distance. For the former, we can also use clip method to analysis the difference between generating images and training images. However, the camera position can not be used to design the analysis since it is not depend on the parameter in neural network. Therefore, we need to choose a new metric which is depend on both the camera position and network parameters to add as a regularized term. We find that the pixel-wise difference between capture and uncapture view is closely related to the camera position so we require
% a relatively low variance in the pixel-wise mean value between visible and invisible images. The relationship between them is introduced in the following lemma:

In our framework, maximizing the mutual information between images involves minimizing both semantic space distance and pixel space distance. For the former, we can use CLIP~\citep{radford2021learning} as a macro regularization. However, camera position cannot be used to analyze pixel space distance as in Section~\ref{Section_view_sampling} because it is independent of NeRF parameters and cannot be optimized. Thus, we need a new metric that depends on both camera position and network parameters to serve as the micro regularization.

To fully utilize simple pixel-wise information, we establish a close relationship between RGB color differences and camera position distance detailed in the following lemma:

% We observe a close relationship between the pixel-wise difference between captured and uncaptured views and the camera position. 
% Thus, we impose a requirement for a relatively low variance in the pixel-wise mean value between visible and invisible images. 
% The relationship between them is detailed in the following lemma:
\begin{restatable}{lemma}{rgb}
\label{lem:rgb}
Assume we have two rays $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ and $\mathbf{\overline{r}}(t) = \mathbf{\overline{o}} + t\mathbf{\overline{d}}$. Assume the function $\sigma(\mathbf{r}(t))$ and $c(\mathbf{r}(t), \mathbf{d})$ learned by MLP is $L$-Lipschitz of $\mathbf{r}(t)$ and $\mathbf{d}$(We usually use Relu activation in MLP and it is a Lipschitz function). Then the distance between RGB colors of two rays can be upper bounded by the  Euclidean distance of two positions of cameras, $\|\mathbf{o}-\mathbf{\overline{o}}\|$, and it can be represented as
    \begin{align*}
       \|\hat{C}(\mathbf{r})- \hat{C}(\overline{\mathbf{r}})\| \le  3L\|\mathbf{o}-\overline{\mathbf{o}}\| + C \,.
    \end{align*}
    where $C$ is a constant independent of the distance $\|\mathbf{o} - \mathbf{\overline{o}}\|$.

    % \vspace{-7mm}
\end{restatable}

From Lemma~\ref{lem:rgb}, we know that the difference in RGB color serves as a lower bound for the difference in camera position. According to Lemma~\ref{lem:camera}, it also acts as a lower bound for pixel space distance. Therefore, to reduce pixel space distance, we aim to minimize the color difference (like color variance or KL divergence) between training ground truth images and randomly rendered images. 

Then we can define our two  plug-and-play regularization terms added to the loss function:
\begin{align*}
    L_{\text{macro}}(R, \overline{R}) &= s(R, \overline{R}) = 1 - \frac{f(R)f(\overline{R})}{\|f(R)\|\|f(\overline{R})\|} \,, \\
    L_{\text{micro}}(R, \overline{R}) &=  \sum_{\mathbf{r}\in R, \overline{\mathbf{r}} \in \overline{R}} \|\hat{C}(\mathbf{r})- \hat{C}(\overline{\mathbf{r}})\|.
\end{align*}

% We will incorporate these two plug-and-play regularization terms into the loss function to maximize the mutual information between the training and generating images.


\subsection{Experiments}



\textbf{Setup}
% To demonstrate the effectiveness of our method, we evaluate our method on three datasets under few-shot settings: comparison with classical NeRF in the classical Blender datasets and comparison with baselines like FreeNeRFFreeNeRF~\cite{yang2023freenerf} which has demonstrated state-of-the-art performance in the few-shot setting, in the popular DTU dataset and LLFF dataset.
% We report PSNR, SSIM, and LPIPS scores as quantitative results.
To demonstrate the effectiveness of our method, we evaluate it on three datasets under few-shot settings: Blender~\citep{mildenhall2020nerf}, DTU~\citep{jensen2014large}, and LLFF~\citep{mildenhall2019local}. We compare our method against classical NeRF and state-of-the-art baselines such as FreeNeRF~\citep{yang2023freenerf}.
% We report PSNR, SSIM, and LPIPS scores as quantitative results. 

% To demonstrate the effectiveness of our method, we evaluate our method on two datasets under few-shot settings: the Blender dataset~\cite{mildenhall2020nerf} and the LLFF dataset~\cite{mildenhall2019local}. 
% LLFF is a real-world dataset consisting of 8
% complex scenes captured by a phone.
% FreeNeRF~\cite{yang2023freenerf} has previously demonstrated state-of-the-art performance across various datasets in the few-shot setting.

% Adhering to the FreeNeRF setup, we ensure a fair comparison with baseline methods. We report PSNR, SSIM, and LPIPS scores as quantitative results.

% For Blender, we follow DietNeRF \cite{jain2021putting} to train on 8 views and test on 25 test images. For LLFF, we use the exact same protocol as RegNeRF and FreeNeRF. We report PSNR, SSIM, and LPIPS scores as quantitative results.

\textbf{Design}
We add our regularization terms $L_{\text{macro}}$ and $L_{\text{micro}}$ to maximize mutual information. Specifically, $L_{\text{micro}}$ is the variance of the mean color value between training images and randomly rendered images, ensuring that the color difference is constrained to some degree.

% \textbf{Ablation on the Blender dataset.}
% Table 
% Due to the consistency term requiring consistency between visible and invisible views, we demand semantic consistency between the images from the training set and the images obtained by NeRF parameters.
% Meanwhile, the uniformity term requires sampling to be diverse, meaning that the images from the training set and NeRF parameters should have enough variance. Thus, we require a sufficiently large variance in the pixel-wise mean value between visible and invisible images.
% In conclusion, the consistency term expects a NeRF to ensure semantic consistency between visible and invisible views, while the uniformity term aims to maintain overall semantic consistency while maximizing diversity.


% \textbf{Dataset} The Blender dataset has 8 synthetic scenes in total. We follow the data split used in DietNeRF~\cite{jain2021putting} to simulate a few-shot neural rendering scenario. For each scene, the training images with IDs (counting from ``0'') 26, 86, 2, 55, 75, 93, 16, 73, and 8 are used as the input views, and 25 images are sampled evenly from the testing images for evaluation. LLFF is a dataset containing a total of 8 scenes. Consistent with RegNeRF and FreeNeRF, we used every 8th image as a new view for evaluation and sampled 3 input views evenly across the rest of the views.




% \textbf{Implementations}
% For Blender dataset, we use the DietNeRF's codebase provided by FreeNeRF. A plain NeRF that consists of two MLPs (a coarse and a fine MLP) is used as the baseline. Using this code base, we can also reproduce the results provided by FreeNeRF. We also use the high-level consistency loss as the uniformity term and use low-level color value to compute the uniformity term.
% For the LLFF dataset, a plain mipNeRF is used as the baseline. In the above experiments, for the sampling of invisible views, we sampled three views uniformly in space.









% \begin{table*}[ht!]
% \begin{center}
% \begin{tabular}{c|c|c|c|c}
% \toprule
% \textbf{Method} &\textbf{Setting}&\textbf{PSNR $\uparrow$}&\textbf{SSIM $\uparrow$}&\textbf{LPIPS $\downarrow$}\\
% \hline
% \hline
% SRF & 
% & 12.34    
% &0.250 
% &0.591 	\\

% PixelNeRF &\small Trained on DTU
% & 7.93      
% &0.272  
% &0.682 \\
% MVSNeRF&
% & 17.25       
% &0.557  
% &0.356\\
% \hline
% \hline
% SRF ft &\small Trained on DTU
% & 17.07 
% &0.436 
% &0.529 \\

% PixelNeRF ft & \small and
% & 16.17    
% &0.438  
% &0.512\\
% MVSNeRF ft&  \small Optimized per Scene
% & 17.88       
% &0.584   
% &0.327 \\
% \hline
% \hline
% Mip-NeRF &
% & 14.62   
% &0.351  
% &0.495 \\

% DietNeRF & \small Optimized per Scene
% & 14.94     
% &0.370   
% &0.496 \\
% RegNeRF &
% & 19.08      
% &0.668   
% &0.283 \\
% \hline
% \hline
% mip-NeRF concat.  &
% & 16.11     
% &0.401  
% &0.460\\

% RegNeRF concat. &   \small Optimized per Scene
% & 18.84    
% &0.573 
% &0.345 \\

% FreeNeRF &
% & 19.63     
% & 0.612 
% & \textbf{0.308}	\\
% \hline
% \hline
% \textbf{FreeNeRF + Ours} &    \small Optimized per Scene
% & \textbf{20.02(+0.39)}    
% & \textbf{0.616(+0.004)} 
% &0.318(+0.01)	\\
% \bottomrule
% \end{tabular}
% \end{center}
% \caption{Quantitative comparison on LLFF.} 
% % \vspace{-5mm}
% \label{tab:llff}
% \end{table*}


% \begin{table*}[ht!]
% \begin{center}
% \begin{tabular}{c|c|c|c}
% \toprule
% \textbf{Method} &\textbf{PSNR $\uparrow$}&\textbf{SSIM $\uparrow$}&\textbf{LPIPS $\downarrow$}\\
% \hline
% \hline
% % Mip-NeRF 
% % & 14.62   
% % &0.351  
% % &0.495 \\

% DietNeRF~\cite{jain2021putting}
% & 14.94     
% &0.370   
% &0.496 \\
% % RegNeRF 
% % & 19.08      
% % &0.668   
% % &0.283 \\
% % \hline
% % \hline
% Mip-NeRF~\cite{barron2021mip} 
% & 16.11     
% &0.401  
% &0.460\\

% RegNeRF~\cite{niemeyer2022regnerf}
% & 18.84    
% &0.573 
% &0.345 \\
% \hline
% \hline
% FreeNeRF~\cite{yang2023freenerf}
% & 19.63     
% & 0.612 
% & 0.308	\\
% \textbf{FreeNeRF + Ours}
% & \textbf{\quad20.01~(+0.38)\quad}    
% & \textbf{\quad0.618~(+0.006)\quad} 
% &\textbf{\quad0.306~(-0.002)\quad}	\\
% \bottomrule
% \end{tabular}
% \end{center}
% \vspace{-3mm}
% \caption{\textbf{Quantitative comparison on LLFF.} There are 3 input views  for training, consistent with FreeNeRF.} 
% % \vspace{-5mm}
% \label{tab:llff}
% \end{table*}

\input{text/tab_freenerf}

\begin{figure*}[t!]
    \centering
\includegraphics[width=0.9\linewidth]{Figures/llff_new.pdf}
    \caption{\textbf{Qualitative comparison on LLFF.}
    % Given 3 input views, we show novel views rendered by FreeNeRF and ours on the LLFF dataset. Compared with FreeNeRF, our method can provide better geometry for the observed objects. For the "horns" and "trex" examples, FreeNeRF fails to render sharp outlines in some places, but our additional losses can gain a more detailed skeleton structure.
    Given 3 input views, we show novel views rendered by FreeNeRF and ours Compared with FreeNeRF. FreeNeRF fails to render sharp outlines in some places, but our additional losses can gain a more detailed skeleton structure and better geometry for the observed objects.
    }
    % \vspace{-1mm}
    \label{llffpic}
    \vspace{-5mm}
\end{figure*}


% // uniformity term consistency term. 

% \textbf{Results on Blender dataset.} Table~\ref{tab:blender} shows the result on the Blender dataset. For all methods, we can directly introduce $L_{\text{macro}}$ and $L_{\text{micro}}$ to verify the usefulness of our proposed framework. For DietNeRF, the consistency loss actually belongs to the $L_{\text{macro}}$, so DietNeRF is a degradation of our framework. We add $L_{\text{micro}}$ to its original design, and the results show that we can form good constraints on the distribution of unobserved samples to make it as consistent as possible.
% For FreeNeRF, since frequency loss does not belong to $L_{\text{macro}}$ or $L_{\text{micro}}$, we additionally add these two terms as constraints. The results show that our framework can also help FreeNeRF to improve the results significantly. 

\textbf{Comparison with baseline methods.} 
Table~\ref{tab:blender} and Figure~\ref{llffpic} present the quantitative and qualitative results of the DTU dataset and the LLFF dataset under a 3-view setting. Table~\ref{tab:ablation} also presents the improvements on the blender dataset under an 8-view setting. Incorporating $L_{\text{macro}}$ and $L_{\text{micro}}$, our method builds on the RegNeRF/FreeNeRF framework, introducing additional regularization terms. These constraints enhance the consistency of unobservable views from both semantic and color perspectives. The improvements in results validate the effectiveness of our approach.
Our framework facilitates the design and application of various regularization terms, leading to improved outcomes. While we focused on $L_{\text{macro}}$ and $L_{\text{micro}}$, our framework is not limited to these specific terms. It allows for the exploration of various regularization methods, providing flexibility to experiment with and integrate different approaches. Detailed explanations are provided in the appendix.

% \textbf{Results on LLFF dataset.} Table~\ref{tab:llff} and Figure~\ref{llffpic} show the quantitative and qualitative results of the LLFF dataset. With the help of $L_{\text{macro}}$ and $L_{\text{micro}}$, we can further improve FreeNeRF by a certain degree. It is noted that all numbers are directly borrowed from FreeNeRF. 
% Our method introduces constraints on the semantics of both observable and unobservable views, as well as the color variance of the unobservable distribution. These additional constraints enhance the consistency of unobservable views from two different perspectives. The improvements in results substantiate the effectiveness of our approach.
% With the help of our framework, we can better design and use various types of regularization terms to improve the results. 
% At the same time, it also inspires us to what perspective we should consider to design new regularization terms.

\textbf{Ablations.} In Table~\ref{tab:ablation}, we decompose two regularization terms to prove the effectiveness of each. For clearer comparison, we compare with classical NeRF, as many methods, such as DietNeRF with semantic consistency loss or FreeNeRF with free frequency regularizations, include various regularization terms that may overlap with ours to some extent.
If we normalize the improvements in PSNR, SSIM, and LPIPS with both regularization terms to 1, the improvements with only $L_{\text{micro}}$ are 0.61, 0.65, and 0.78, respectively. With only $L_{\text{macro}}$, the improvements are 0.89, 0.79, and 0.90. While $L_{\text{macro}}$ has a slightly more significant impact, using both terms together yields the best results. 

% For instance, in the "flower" example in Figure~\ref{llffpic}, our methods can better distinguish the outline of the flowers from the background. Consistency term allows NeRF to better maintain the consistency of the geometric outline, and  uniformity term enables NeRF to render various flowers from different perspectives. 



% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.8\linewidth]{Figures/unseen.png}
%     \caption{\textbf{Reconstructing unobserved regions.}Renderings of occluded regions during training. 14 images of the right half of the Realistic Synthetic Lego scene are used to
% estimate radiance fields. }
%     \label{unseen}
% \end{figure}

% \subsubsection{Ablation}
% % In this section, we ablate our design choices on the Blender dataset and demonstrate our method can get better results when reconstructing unobserved regions.
% \paragraph{Impact of Homogeneity and uniformity term.}
% % dietnerf其实是我们方法的退化，只有第三项的时候也有提升（跟nerf比）。

% In the experiments on the Blender dataset, without uniformity term, our method degenerates into DietNeRF, which causes a decrease of 0.36 PSNR. With uniformity term only, our method has only 2.693 improvements (17.627) compared to NeRF. This is because the uniformity term itself loses its meaning when the unobserved distribution is significantly different from the observed distribution. So uniformity term is actually based on uniformity term. When we add both to FreeNeRF, we can get a significant improvement of 0.637 PSNR, which shows that our framework is very effective.

% \textbf{Different features for uniformity term and uniformity term.}
% % 不用大模型效果如何？ 实验在跑
% % High-level features for uniformity term and Low-level features for uniformity term
% We find that using high-level features, such as CLIP, Pretrained ViT, etc., to compute uniformity term works best. When we use low-level features such as the mean of pixel colors as a distribution consistency constraint, our method decreases the PSNR by 0.428 (24.468). This is because the consistency of the high-level feature guarantees consistency at the semantic level, which is the basic assumption of the new shot novel view synthesis.
% On the contrary, when calculating the uniformity term, we find that the low-level feature such as the average of pixel colors is the most effective. Because uniformity term is based on uniformity term, the results of novel view synthesis are required to be as diverse as possible, and this diversity is often reflected in the low-level picture details. When we use high-level features to calculate uniformity term, this is actually contrary to the objective of uniformity term. 
% In conclusion, the consistency of high-level features in uniformity term limits the generation space of novel view synthesis, and the diversity of low-level features in uniformity term asks them to be dispersed in the subspace as much as possible.

% \textbf{Reconstructing unobserved regions.}
% We evaluate whether our method produces plausible completions when the reconstruction problem is under-determined. For training, we sample 14 nearby views of the right side of the Realistic Synthetic Lego scene.
% Narrow baseline multi-view capture rigs are less costly than
% 360◦ captures and supports unbounded scenes. However,
% narrow-baseline observations suffer from occlusions: the
% left side of the Lego bulldozer is unobserved. 
% NeRF fails to reconstruct this side of the scene, while Simplified NeRF
% learns unrealistic deformations and incorrect colors. Remarkably, With uniformity term, DietNeRF learns accurate colors and shapes in the missing regions. With Homogeneity and uniformity term, Our method can learn more quantitatively and qualitatively realistic and accurate shape and appearance, suggesting that our framework is a more complete and effective approach.


