DILF: Differentiable rendering-based multi-view Image-Language Fusion for zero-shot 3D shape understanding
Abstract: Highlights•A differentiable renderer fuses explicit text guidance into rendering process to produce informative multi-view images.•We propose the group-view mechanism and LLM-assisted textual feature learning, enabling efficient text–image fusion.•It achieves state-of-the-art for zero-shot 3D classification, competitive in standard 3D classification.
Loading