Keywords: 3D object annotation ,Multi-agent systems,Cross-view consistency
TL;DR: Tri-MARF, a novel tri-modal multi-agent framework, integrates 2D images, text, and 3D point clouds with specialized agents to enhance 3D object annotation, achieving superior accuracy, retrieval, and throughput.
Abstract: Driven by the applications in autonomous driving, robotics, and augmented reality, 3D object annotation is a critical task compared to 2D annotation, such as spatial complexity, occlusion, and viewpoint inconsistency. The existing methods relying on single models often struggle with these issues. In this paper, we introduce Tri-MARF, a novel framework that integrates tri-modal inputs (i.e., 2D multi-view images, text descriptions, and 3D point clouds) with multi-agent collaboration to enhance the 3D annotation process. Our Tri-MARF consists of three specialized agents: a vision-language model agent that generates multi-view descriptions, an information aggregation agent that selects optimal descriptions, and a gating agent that aligns text descriptions with 3D geometries for more refined captioning. Extensive experiments on the Objaverse-LVIS, Objaverse-XL, and ABO datasets demonstrate the superiority of our Tri-MARF, which achieves a CLIPScore of 88.7 (compared to 78.6–82.4 for other SOTA methods), retrieval accuracy of 45.2/43.8 (ViLT R@5), and an impressive throughput of 12,000 objects per hour on a single NVIDIA A100 GPU.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 1251
Loading