Keywords: Riemannian geometry, multimodal biometrics, cross-modal person matching, Gromov-Hausdorff distance, intrinsic dimensionality, embedding space analysis, face-voice association, manifold learning
TL;DR: Representational similarity, Centered Kernel Alignment (CKA), predicts face-voice matching difficulty with ρ = −0.87, case for low/no training required.
Abstract: Pretrained biometric encoders project faces and voices into high-dimensional Euclidean spaces, yet their outputs concentrate near low-dimensional Riemannian submanifolds whose geometry is poorly understood. We characterize the intrinsic geometry of seven face and voice encoders - measuring intrinsic dimensionality, local curvature via the second fundamental form, and cluster topology - and
ask whether these geometric quantities predict cross-modal person-matching difficulty. Across 12 encoder pairs on VoxCeleb1 (1,249 identities), CKA similarity correlates with cross-modal equal error rate (EER) at Spearman ρ = −0.87 (p < 0.001), and a multivariate model achieves leave-one-out cross-validated R2 = 0.77. These results suggest that intrinsic geometry provides an informative, task-agnostic predictor of cross-modal matching difficulty.
Submission Number: 129
Loading