Multimodal LLM-driven language-embedded 3D gaussian splatting for semantic and realistic digitization of historical buildings

Zhenyu Liang, Chak-Fu Chan, JIAYING ZHANG, Zhaolun Liang, Boyu Wang, Mingzhu Wang, Jack C.P. Cheng

Published: 06 Nov 2025, Last Modified: 08 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Digitalization of historical buildings is essential for their preservation and dissemination. However, existing digital models struggle to simultaneously achieve realistic visualization, semantic enrichment, and user-friendly interaction. Additionally, heritage scenarios face challenges of specialized semantics and data unavailability. Therefore, this paper proposes Heritage-3DGS, an MLLM-driven language-embedded 3DGS framework for generating realistic and semantically enriched digital models that integrate domain-specific terminology and enhance user engagement while minimizing manual effort. It comprises four steps: (1) collection of on-site images and component textual descriptions; (2) SAM-MLLM-based component segmentation, generating semantic masks with only one manually annotation per component; (3) optimized language-embedded 3DGS to efficiently reconstruct 3D semantic field enriched with domain-specific knowledge; and (4) chatbot integration for open-vocabulary and fuzzy searches. Validation experiments on two cathedrals in Guangzhou and Hong Kong, China, achieved average 84.15 % mIoU for domain-specific semantic field reconstruction and 0.92 SSIM for realistic scene representation, demonstrating its effectiveness and applicability.