Dual-Augmentation Consistency with Hierarchical Teacher Signals for One/Two-Shot Collaborative 3D Object Detection
Abstract: Collaborative 3D object detection benefits from multi-agent feature sharing, yet most existing methods presuppose large-scale, fully annotated multi-vehicle datasets—an assumption that breaks down when annotation is extremely scarce. We study an extreme-label regime for cooperative perception where each scene contains only one or two 3D bounding-box annotations, creating a challenging lower bound for supervised and semi-supervised learning. To learn from abundant unlabeled cooperative scenes, we propose a teacher–student framework that couples (i) cross-agent Augmentation Alignment consistency, enforcing prediction agreement between weakly perturbed and strongly corrupted point clouds under collaboration-aware fusion, and (ii) hierarchical teacher supervision, where an Multi-level Guidance teacher provides both high-confidence pseudo boxes and intermediate representation targets to align student features across agents and augmentations. Additionally, we introduce collaboration-aware filtering that accounts for inter-agent agreement to improve pseudo-label precision under sparse supervision. Experiments on OPV2V, DAIR-V2X and WSAA show that the proposed approach substantially improves one-shot and two-shot cooperative detection performance, demonstrating that reliable collaborative perception models can be trained with minimal annotations.
Loading