Adapt the Face, Not the Voice: Asymmetric Fine-Tuning of Foundation Models for Cross-Modal Person Matching

TMLR Paper7629 Authors

22 Feb 2026 (modified: 25 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Cross-modal person matching - associating a person’s voice with their face - requires bridging speech and vision representations that share no direct physical correspondence. We investigate a simple approach: pairing frozen unimodal foundation models (WavLM-Large for speech, SigLIP ViT-B/16 for faces) with lightweight trainable projections into a shared embedding space. Our central finding is an informative asymmetry in the effectiveness of Low-Rank Adaptation (LoRA): adapting the face encoder yields substantial gains while adapting the voice encoder provides no benefit. We explain this asymmetry through layer-wise identity probing: WavLM already encodes strong speaker identity information (93.8% linear probe accuracy on 70 classes), while SigLIP’s face identity representations are comparatively weak (79.5%), leaving substantially more room for task-specific adaptation. This gap widens on a larger evaluation: on 1,211-identity VoxCeleb1, WavLM maintains 90.5% probe accuracy while SigLIP drops to 58.1%. The asymmetric LoRA finding replicates across two datasets -MAV-Celeb (70 identities, per-identity split) and VoxCeleb1 (1,211 identities, identity-disjoint split) - and across evaluation protocols including verification, retrieval, and N -way matching. On MAV-Celeb, face-only LoRA achieves 16.6 ± 0.4% Equal Error Rate (mean ± std over 3 seeds) with only 1.33M trainable parameters (0.32% of the encoder total), compared to 19.9% for the prior best published result under a comparable (though not identical) evaluation protocol. Our results suggest a hypothesis for cross-modal adaptation: selectively adapting the encoder whose pretraining is least aligned with the target task is both necessary and sufficient.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=SQffGAieAp
Changes Since Last Submission: Per the guidance from previous submission rejection: 1. Ensured the TMLR LaTeX style is correctly configured to be for the peer review stage. Now we see this line in the header. "Under review as submission to TMLR" 2. Fixed appendix styling to align with TMLR tempate. Kindly suggest any other changes that may be necessary.
Assigned Action Editor: ~Prayag_Tiwari1
Submission Number: 7629
Loading