TMF-Net: Multi-modal Transformer Fusion for Relative Pose Estimation of Non-Cooperative Targets

Published: 28 Apr 2026, Last Modified: 15 May 2026IEEE ICRA 2026 Workshop SRWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spacecraft Pose Estimation, Sensor Fusion, Transformers, Multi-modal Learning, Non-cooperative Targets.
TL;DR: A Multi-modal Transformer Fusion Architecture for Resilient 6-DOF Pose Estimation in Dynamic Space Environments.
Abstract: As space missions shift toward more agile, low size, weight, power, and cost (SWaP-C) platforms, vision- based navigation is increasingly critical for autonomy. However, the inherently dynamic and unstructured nature of space characterized by extreme illumination variations, high-contrast shadowing, and complex Earth albedo, poses fundamental challenges to the reliability of purely vision-based systems. This work introduces Multi-modal Transformer Fusion Network (TMF-Net), an architecture for the six-degree-of-freedom (6-DoF) pose estimation of non-cooperative spacecraft. While classical registration is bottlenecked by the requirement for prior 3D geometry and 3D Light Detection and Ranging (LiDAR) imposes prohibitive mass, power, and computational overhead, TMF-Net achieves precise scale resolution for un-mapped targets by fusing sparse 1D range data via Fourier Feature Encoding, allowing the network to effectively correlate 1D distances with 2D spatial features. TMF-Net tokenizes visible and thermal imagery alongside 1D Laser Rangefinder (LRF) data into a unified latent representation. Through a multi-task learning framework, TMF-Net simultaneously estimates translation, rotation, and pointing attitude error (∆q), enabling downstream guidance, navigation, and control (GNC) systems to maintain a precise sensor lock on unmapped targets. Our results demonstrate that this fused approach provides a resilient perception solution that significantly outperforms single and bimodal approaches in degraded illumination scenarios.
Submission Number: 21
Loading