TGFM: Text-Guided Frequency Modulation for Source-Data-Free Adaptation of Vision-Language Models in VQA

TGFM: Text-Guided Frequency Modulation for Source-Data-Free Adaptation of Vision-Language Models in VQA

ACL ARR 2026 January Submission657 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Source-Data-Free VQA Adaptation, Vision-Language Models, Text-Guided Frequency Modulation

Abstract: Pre-trained vision-language models (VLMs) have achieved remarkable success in general-purpose multimodal learning. However, adapting them to domain-specific visual question answering (VQA) scenarios remains challenging due to scarce annotations, substantial distribution shifts, and the practical impossibility of accessing source-domain data in real-world deployments. Meanwhile, many existing adaptation strategies rely on domain- or task-specific architectures, limiting their scalability and transferability. We propose Text-Guided Frequency Modulation (TGFM), a source-data-free, target-supervised framework for VQA adaptation that enables fine-grained cross-modal interaction directly in the image frequency domain. TGFM employs a text-guided spectral mask to jointly modulate amplitude and phase, where amplitude captures global structure and phase encodes detailed semantic variations, providing a complementary pathway to spatial-domain adaptation. To ensure robust learning, we design a frequency loss combining low-frequency preservation, text-conditioned band alignment, and spectral regularization for sparsity, smoothness, and semantic coherence. Extensive experiments across six domain-specific VQA benchmarks demonstrate that TGFM consistently outperforms both conventional fine-tuning and state-of-the-art source-data-free approaches, incurring only around 1 million additional parameters.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Vision and Language, Multimodal Learning, Domain Adaptation, Representation Learning, Robustness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 657

Loading