\section{Introduction}
NeRF~\citep{mildenhall2020nerf} (Neural Radiance Fields) is an advanced technique in computer graphics and computer vision that enables highly detailed and photorealistic 3D reconstructions of scenes from 2D images~\citep{zhang2020nerf++, park2021nerfies, pumarola2021d}. It represents a scene as a 3D volume, where each point in the volume corresponds to a 3D location and is associated with a color and opacity. The key idea behind NeRF is to learn a deep neural network that can implicitly represent this volumetric function, allowing the synthesis of novel views
of the scene from arbitrary viewpoints.


% Although NeRF can synthesize high-quality and realistic images, it often relies on a large amount of high-quality training data \cite{yu2021pixelnerf}. The performance of NeRF drastically decreases when the number of training data is reduced. 
% The existing methods to improve quality include adding new samples and designing regularization terms to introduce priors. For adding new samples, ActiveNeRF\cite{pan2022activenerf} incorporates uncertainty estimation into the NeRF
% model and it proposes a method based on an active learning scheme to select samples that provide the maximum information gain. This approach allows for improving the quality of novel view synthesis with minimal additional resource investment. However, it evaluates information gained by the variance change of prior and posterior distributions, which is merely an intuitive method and may result in unreliable results. Moreover, it takes a long time to select a new view, which significantly increases training time;
% For regularization terms, a series of works \cite{niemeyer2022regnerf,yang2023freenerf,yu2021pixelnerf,jain2021putting,zhou2022sparsefusion,deng2022nerdi}, have studied how to utilize prior knowledge to achieve high-quality novel view synthesis with limited training samples, such as leveraging additional sources of prior information, designing more efficient training algorithms, and incorporating domain knowledge to enhance generalization.

 Although NeRF can synthesize high-quality images, it often relies on a large amount of high-quality training data~\citep{yu2021pixelnerf}. The performance of NeRF drastically decreases when the number of training data is reduced.  To mitigate this, existing strategies include adding new samples to the dataset and integrating regularization terms to introduce prior knowledge. 
% For adding new samples, ActiveNeRF \cite{pan2022activenerf} innovatively merges uncertainty estimation within the NeRF framework.
For adding new samples, ActiveNeRF~\citep{pan2022activenerf}  aims to supplement the existing training set with newly captured samples based on an active learning scheme. It incorporates uncertainty estimation into a NeRF model and selects the samples that bring the most information gain. 
% It introduces an active learning-based methodology to selectively acquire samples that maximize informational yield.
% This technique enhances the quality of synthesized novel views with minimal additional resources.
However, its reliance on the variance shift between prior and posterior distributions as a metric for information gain is somewhat speculative and can lead to unreliable outcomes.
% Additionally, the process of selecting new views is time-intensive, which considerably prolongs the training duration.
Regarding regularization, a plethora of studies~\citep{niemeyer2022regnerf,yang2023freenerf,yu2021pixelnerf,jain2021putting} have explored the integration of prior or domain-specific knowledge to facilitate high-quality novel view synthesis and enhance generalization capabilities, even with limited training data.
% These approaches range from utilizing external sources of prior information, and devising more efficient training algorithms, to incorporating domain-specific knowledge, thereby enhancing the model's generalization capabilities.
However, many of these methods lack theoretical support, hindering their explanation and optimization within a unified framework.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\linewidth]{Figures/teaser.pdf}
    % \vspace{-1em}
    \vspace{-1mm}
    \caption{\textbf{The overview of MutualNeRF.} We introduce a novel and generic NeRF framework, comprehensively integrating mutual information from macro (semantic space) and micro perspectives (pixel space). This dual-perspective framework adeptly addresses challenges in sparse view sampling and few-shot view synthesis.}
    \label{fig:teaser}
    \vspace{-1em}
    \vspace{-3mm}
\end{figure}


% Our key idea is to model the multivariate mutual information between known examples and unknown examples to guide us in introducing new samples and designing regularization terms. Inspired by TupleInfoNCE\cite{liu2021contrastive} which models the mutual information among different modalities to facilitate multi-modality fusion. We can treat the known samples and unknown samples as different modalities and model the multivariate mutual information between them, which can guide us to improve the synthesis qualities by introducing new samples as little as possible and adding new regularization terms.
% In this work, we aim to tackle these challenges with a key idea: the modeling of multivariate mutual information to bridge the gap between known and unknown examples.
% This concept is inspired by TupleInfoNCE \cite{liu2021contrastive}, which effectively models mutual information across different modalities to enhance multi-modal fusion. In our method, known and unknown samples are analogous to different modalities. Modeling the intricate multivariate mutual information between them can guide us to improve the synthesis qualities by introducing new samples as little as possible and adding new regularization terms.

% Indeed, intuitively, for the efficient and highly generalizable reconstruction of scenes, our approach should aim to minimize the correlation between images in the training set while maximizing the correlation between the inferred images and the ground truth. From this perspective, it is crucial to develop a theoretically sound, computationally efficient, unified, and interpretable concept and metric to measure the correlation between images. 
% Indeed, for efficient and highly generalizable scene reconstruction, our strategy should focus on minimizing the correlation between training images during view sampling while maximizing the correlation between the inferred images and the ground truth during view synthesis. This approach necessitates a theoretically robust, computationally efficient, and interpretable metric to gauge image correlation.

Confronting challenges in the few-shot scenarios, we introduce a theoretically robust and computationally efficient strategy addressing two pivotal tasks: sparse view sampling and few-shot view synthesis. Sparse view sampling targets acquiring training images from a selection of candidate views without knowing their ground truth images. Our strategy intuitively emphasizes minimizing the correlation between training images for more unique information. Transitioning to few-shot view synthesis, our strategy involves training a NeRF model on a predetermined training set, aiming to maximize the correlation between inferred images and ground truth in the view synthesis process.

In this work, we introduce the concept of Mutual Information as an interpretable metric to model correlation. This concept is inspired by TupleInfoNCE~\citep{liu2021contrastive}, which effectively models mutual information across different modalities to enhance multi-modal fusion. Mutual information serves as a metric for quantifying the uncertainty between variables, especially pertinent in the NeRF context. On the one hand, it can guide us in selecting inputs to encapsulate maximal information with fewer images. On the other hand, it assesses the uncertainty of unknown view synthesis given known views. 
% contributing to the evaluation of image effectiveness in scene representation.


% Previous NeRF generally achieved high-quality synthesis by extensively sampling training samples to cover various details of scenes. Additionally, a series of works \cite{niemeyer2022regnerf,yang2023freenerf,yu2021pixelnerf,jain2021putting,zhou2022sparsefusion,deng2022nerdi}, have studied how to utilize prior knowledge to achieve high-quality novel view synthesis with limited training samples, such as leveraging additional sources of prior information, designing more efficient training algorithms, and incorporating domain knowledge to enhance generalization. However, these methods have not explored the essence of this problem, which is how to get the optimal training set under limited view inputs. Exploring this problem can not only enable efficient sampling, avoiding data redundancy but also guide us in designing prior regularization terms for novel view synthesis with limited training samples.

% {\color{blue}Describe our method: imply the mutual information in our framework. Define mutual information at the macro level(semantic) and micro level(position, RGB). (Emphasis that our method does not need the GT unknown view.)}

% We introduce an algorithm framework for Neural Radiance Fields (NeRF) with sparse samples. The methodology integrates information theory concepts, focusing on mutual information to assess image effectiveness in scene representation. This involves quantifying the dependency between known and unknown views, to minimize mutual information to reduce uncertainty in predictions. The approach is two-pronged: the macro perspective uses semantic image features, particularly employing the CLIP~\cite{radford2021learning} method for similarity measurement, while the micro perspective deals with the decomposition of relative information between images based on ray differences. Additionally, the framework considers multiple training set images for predicting unknown scenes, introducing Multivariate Information. This comprehensive strategy aims to optimally select sparse samples, ensuring effective scene depiction and generalization to new views. With this framework, we can design strategies to add new samples to the dataset and integrate new regularization terms.

% We present a novel algorithmic framework designed to tackle board challenges associated with sparse view sampling and few-shot view synthesis. Mutual information serves as a metric for quantifying the uncertainty between variables, especially pertinent in the NeRF context, where it assesses the uncertainty of unknown view synthesis given known views, contributing to the evaluation of image effectiveness in scene representation.


% {\color{blue} Add the discussion of activenerf, change the intro}
We approach mutual information from both macro and micro perspectives. The macro perspective focuses on the correlation in semantic features, particularly employing the CLIP~\citep{radford2021learning} method for semantic space distance, while the micro perspective in pixel space deals with the decomposition of relative information between images based on ray differences. To ensure feasibility and computational efficiency, pixel space distance is correlated with the Euclidean distance between camera positions and RGB color differences. Furthermore, we take into account multiple training set images for unknown scenes, introducing mutual information for multiple images.
% This comprehensive strategy aims to optimally handle sparse samples, ensuring effective scene depiction and generalization to new views. 

Leveraging mutual information as the metric, our novel algorithmic framework can tackle board challenges in sparse view sampling by introducing new samples with less mutual information, and few-shot view synthesis by adding new regularization terms to increase the mutual information between the inferred images and the ground truth.

% We approach mutual information from a two-fold perspective:
% the macro perspective in semantic space utilizes semantic image features, particularly employing the CLIP~\cite{radford2021learning} method for similarity measurement, while the micro perspective in pixel space deals with the decomposition of relative information between images based on ray differences.
% Furthermore, the framework takes into account multiple training set images for unknown scenes, introducing Multivariate Information. This comprehensive strategy aims to optimally handle sparse samples, ensuring effective scene depiction and generalization to new views. With this framework, we can devise strategies for sparse view sampling and integrate new regularization terms for novel view synthesis.

% {\color{blue}Describe our method for sampling + experiments.}

% For sparse view sampling, we focus on devising a strategy for the selection of a limited number of training images from the entire set, aiming to minimize the mutual information between selected known images from the total set for better information gain. A greedy algorithm is proposed as a near-optimal solution, utilizing a looking-ahead strategy to iteratively select images that add the most novel information compared to already selected ones. 

% For sparse view sampling, we focus on devising a strategy for the selection of a limited number of training images from the entire set. The objective is to minimize the mutual information between the chosen images for better information gain. 
% We propose a greedy algorithm as a near-optimal solution, integrating a look-ahead strategy to iteratively select images based on their contribution to novel information. This selection is determined by both the semantic space difference derived from CLIP and the pixel space difference calculated by the simple Euclidean distance between camera positions. 

In sparse view sampling, the task is to select a subset of images from a candidate view set with unknown ground truth to supplement the training process. Ground truth is revealed only after selection, following the active learning framework.
Our strategy focuses on minimizing redundancy in the selected views to maximize information gain.
% For sparse view sampling, our strategy centers on selecting a limited subset of training images from the candidate view set without knowing the ground truth, aiming to minimize their mutual information to maximize information gain.
We introduce a computationally efficient greedy algorithm with a look-ahead strategy, which functions as a near-optimal solution. This algorithm iteratively selects images based on their contribution to unexplored information. The selection criteria combine semantic space distances, as derived from CLIP, with pixel space distance, calculated using the Euclidean distance between camera positions.

% The methodology is tested in an active learning setting similar to the approach in ActiveNeRF, using a "train-render-evaluate-pick" scheme. 

% {\color{blue}Describe our method for regularization + experiments.}


% We also explore the extension of our framework to few-shot view synthesis, focusing on the design of regularization terms in loss functions for neural networks when limited training samples are available. Utilizing mutual information, the goal is to enhance the generalization ability of the network to new views. The regularization term is devised using the CLIP analysis to assess differences between new and training views. However, as camera position is not a neural network parameter, we pivot to a metric dependent on both camera position and network parameters. The chosen metric involves the pixel-wise distribution difference between known and unknown views which is related to Euclidean distance between camera positions.
% For few-shot view synthesis, we focus on designing plug-and-play regularization terms in loss functions for neural networks when limited training samples are available. The objective is to maximize the mutual information between the training set and the test set for better generalization. The regularization term is devised using CLIP  to assess differences between new and training views. However, as camera position is not a neural network parameter, we pivot to a metric dependent on both camera position and network parameters. The chosen metric involves the pixel-wise distribution difference between known and unknown views which is related to the distance between RGB colors of two rays.
For few-shot view synthesis, the task is to directly train a NeRF model with limited and fixed training samples.
we aim to develop efficient, plug-and-play regularization terms for the training procedure.
The objective is to maximize the mutual information between training images and randomly rendered images, expecting inferred
images to gain more relevant information from known images. We assess semantic space distance by CLIP as the macro regularization term. As camera position is invariant to the parameter of the NeRF, we utilize a computationally efficient metric dependent on both camera positioning and network parameters.  
It serves as the micro regularization term and assesses pixel-wise distribution differences between known and unknown views.
 
% The main contributions of this paper are listed as follows: 
% \begin{enumerate}
%     \item We introduce a novel framework of Nerf by mutual information. We describe the mutual information of two images in macro and micro perspectives. From macro prospective, 
%     \item Activenerf setup. Using a lookahead strategy and greedy algorithm to choose image.
%     \item Regularization setup. Prove that camera position is connected with rgb color.
% \end{enumerate}


Finally, we have experimentally validated our conclusions.
In sparse view sampling, following the ActiveNeRF protocol, we start with several initial images and supplement new viewpoints to evaluate the information gain brought by our sampling strategy. The experiments demonstrate that our strategy achieves the best performance with the introduction of the same number of new viewpoints.
For few-shot novel view synthesis, we compare our designed regularization terms with state-of-the-art baselines, showing consistent improvements across three datasets. An ablation study further analyzes the contribution of each term.
Remarkably, the mutual information metric, intuitive and straightforward yet theoretically robust, proves to efficiently guide the NeRF process at both input and output stages with simple quantitative computation in our framework.

% First, we check whether this conclusion can guide sampling, followed by ActiveNeRF. We provide several initial images and supplement new viewpoints to evaluate whether the sampling strategy can bring the most information gain. The experiments demonstrate that based on our sampling strategy, the best performance can be achieved by introducing the same number of new viewpoints.
% Additionally, we design a regularization term based on this conclusion to constrain the few-shot training of NeRF, and the experiments show that these regularization terms can significantly improve performance. 
% Both experiments demonstrate that our framework is general and effective. We surprisingly find that this intuitive and straightforward metric, mutual information, underpinned by a robust theoretical foundation, can guide the NeRF process at both the input and output stages with efficient quantitative computation in our framework.
% Remarkably, the mutual information metric, intuitive and straightforward yet theoretically robust, proves to efficiently guide the NeRF process at both input and output stages with simple quantitative computation in our framework.
