Fine-Tuning without Forgetting: Domain Generalizable Adaptation of 3D Vision-Language Models

TMLR Paper6754 Authors

02 Dec 2025 (modified: 11 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time scaling strategy to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%-outperforming prior state-of-the-art methods with minimal added computational overhead.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Xuming_He3
Submission Number: 6754
Loading