Abstract: Metric-semantic 3D mapping is the process of creating class-labeled 3D maps by fusing the information from images captured by a moving camera. The memory usage required by standard solutions grows linearly with the number of semantic classes being considered, which can pose a bottleneck in large and many-class scenes. This paper proposes two novel methods for compressing the memory used by semantic fusion: calibrated top-k histogram and encoded fusion. The first method maintains, for each voxel, only the counts of the k most likely classes, while the second method uses a neural network to encode all-class probability vectors into a k-dimensional latent space in which per-voxel fusion is performed. The fused result is then decoded, at query time, using another neural network. Experiments show that both methods preserve map accuracy and calibration even at low values of k, and per-voxel memory usage is linear in k. The proposed methods can achieve real-time semantic fusion with 150 classes on commodity GPUs in building-scale scenes where prior approaches run out of memory.
Loading