\section{Related Works}


\subsection{Diffusion Models}

Although diffusion generative models were first introduced by Sohl-Dickstein \etal \cite{sohldickstein2015deep}, they gained widespread attention following the DDPM work \cite{ho2020denoising}, which significantly improved diffusion processes and sampling strategies. DDIM \cite{song2022denoising} further enhanced efficiency by introducing a non-Markovian denoising process, enabling faster sampling with superior quality.

Although diffusion models eventually surpassed GANs in quality \cite{dhariwal2021diffusion}, they remained computationally expensive due to their high-dimensional pixel-space operations. The Latent Diffusion Model (LDM) \cite{rombach2021highresolution} addressed this by incorporating a VAE, reducing dimensionality and improving efficiency.

Stable Diffusion (SD) \cite{rombach2021highresolution} became the most widely adopted LDM implementation. Despite numerous advancements \cite{zhang2023improvingdenoisingdiffusionmodels, yu2023debiastrainingdiffusionmodels, nielsen2024diffencvariationaldiffusionlearned, xiao2023upgradingvaetrainingunlimited, katzir2023noisefreescoredistillation}, the most recognized diffusion model today is Stable Diffusion XL \cite{podell2023sdxl}, which achieves state-of-the-art realism and quality.


\subsection{3D Representation Models}

When estimating a 3D model from a set of images, geometry can be represented using explicit or implicit techniques. Explicit representations are widely used due to their simplicity and efficiency, with common approaches including 3D voxels \cite{yu2021plenoxelsradiancefieldsneural, schwarz2022voxgraffast3dawareimage, choy20163dr2n2unifiedapproachsingle}, meshes \cite{gao2022get3d, pavllo2021learninggenerativemodelstextured, gao2021tmnetdeepgenerativenetworks, wang2018pixel2meshgenerating3dmesh}, and point clouds \cite{yang2019pointflow3dpointcloud, zeng2022lionlatentpointdiffusion, luo2021diffusionprobabilisticmodels3d, sun2019pointgrowautoregressivelylearnedpoint}.

Implicit representations are also widely adopted, with many works leveraging signed distance functions (SDF) \cite{shen2021deep, Shim_2023_CVPR, yariv2024mosaicsdf3dgenerativemodels, chen2019learning} and occupancy fields \cite{zhang20223dilgirregularlatentgrids, zhang20233dshape2vecset3dshaperepresentation, mescheder2019occupancy}. However, NeRF \cite{mildenhall2020nerf} and its extensions \cite{chen2021mvsnerf, deng2022depthsupervised, yu2021pixelnerf, wang2023sparsenerf, barron2022mipnerf, verbin2021refnerf, metzer2022latentnerf} dominate much of the research in implicit 3D modeling.

Despite extensive efforts to accelerate NeRF, even the latest optimizations remain computationally demanding. A significantly faster alternative is 3D Gaussian Splatting (3D GS) \cite{kerbl20233d}, which not only surpasses NeRF in speed but also achieves superior visual quality while enabling real-time rendering. Due to its efficiency, 3D GS is increasingly being adopted in 3D generation tasks \cite{chen2024textto3d, yi2023gaussiandreamer, li2023gaussiandiffusion3dgaussiansplatting, lan2023gaussian3diff, mu2024gsdviewguidedgaussiansplatting, xu2024agg}.


\subsection{Text-to-3D}

One of the most influential works in text-to-3D generation using diffusion models is DreamFusion \cite{poole2022dreamfusion}, which introduced the Score Distillation Sampling (SDS) loss. This novel loss function operates in parameter space, using a frozen diffusion model as a critic to guide NeRF optimization. SDS remains a fundamental component in many modern 3D generation methods \cite{shi2024mvdream, wang2023prolificdreamerhighfidelitydiversetextto3d, yu2023painthumanhighfidelitytextto3dhuman, sun2023dreamcraft3dhierarchical3dgeneration, yu2023textto3dclassifierscoredistillation}.

A common strategy is training a multi-view diffusion, as introduced in MVDream \cite{shi2024mvdream}, to generate consistent multi-views \cite{Tang2023mvdiffusion}. \textsc{Gsgen} \cite{chen2024textto3d} is among the first works to integrate 2D diffusion-based generation with 3D Gaussian Splatting, coupled with pre-trained text-to-point-cloud diffusion Point-E \cite{nichol2022pointe} and effective guidance of the geometry estimation of 3D Gaussians using 3D SDS loss.


\subsection{Image-to-3D \& Re-texturing}

The re-texturing problem differs from full 3D reconstruction as it often requires no geometry changes and can be addressed by generating PBR materials \cite{lopes2023material, youwang2023paintit}. However, these methods may not apply to implicit geometry, necessitating further optimization of the 3D model. Many image-to-3D models can also be adapted for re-texturing tasks \cite{metzer2022latentnerf, zeng2024ipdreamer}.

Latent-NeRF \cite{metzer2022latentnerf} was among the first diffusion-based methods to generate 3D objects using both text and image inputs, introducing a model for retexturing based on pattern images. IP-Dreamer \cite{zeng2024ipdreamer} expanded this approach, being the first to implement Image Prompt (IP) control in Stable Diffusion with modifications to the SDS loss. Several other methods can generate 3D models from a single image \cite{deng2022nerdi, melaskyriazi2023realfusion360degreconstructionobject, tang2023makeit3dhighfidelity3dcreation, xu2023neurallift360liftinginthewild2d}, with DreamGaussian \cite{tang2024dreamgaussiangenerativegaussiansplatting} being the most relevant, as it utilizes 3D Gaussian Splatting for reconstruction.


\subsection{3D Editing}

The editing task gained significant popularity following the Prompt-to-Prompt work \cite{hertz2022prompttopromptimageeditingcross}, which eliminated the need for manual mask selection and introduced edits controlled solely by text prompts through attention mechanisms. This innovation has spurred further research in the domain \cite{haque2023instructnerf2nerfediting3dscenes, armandpour2023reimaginenegativepromptalgorithm, brooks2023instructpix2pix, parmar2023zeroshotimagetoimagetranslation}.

Recent approaches that leverage 3D Gaussian Splatting for geometry optimization include GSEdit \cite{palandra2024gseditefficienttextguidedediting}, GaussianEditor \cite{fang2023gaussianeditorediting3dgaussians}, View-consistent Editing (VcEdit) \cite{wang2025viewconsistent3deditinggaussian} and GaussCtrl \cite{wu2024gaussctrlmultiviewconsistenttextdriven} models. GSEdit \cite{palandra2024gseditefficienttextguidedediting} iteratively guides the reconstruction process using Score Distillation Sampling (SDS) loss with Instruct Pix2Pix as the diffusion model, allowing refined edits based on user prompts. GaussianEditor \cite{fang2023gaussianeditorediting3dgaussians} separates editing tasks into object removal and incorporation through semantic tracing, followed by Hierarchical Gaussian Splatting (HGS). VcEdit \cite{wang2025viewconsistent3deditinggaussian} introduces 3DGS coupled with Cross-attention and Editing Consistency modules to improve multi-view consistency. GaussCtrl \cite{wu2024gaussctrlmultiviewconsistenttextdriven} employs depth guidance with ControlNet \cite{zhang2023adding} to enhance geometric consistency and the attention-based latent code alignment module to improve texture consistency. 

