Client-Aware Multimodal Distillation with Adaptive Aggregation for Robust Federated Learning in Noisy and Adversarial Environments
Keywords: Federated Learning, Knowledge Distillation, Multimodal Representation Learning, ; Adversarial Robustness, Adaptive Aggregation
TL;DR: We propose a client-aware multimodal distillation method with adaptive aggregation that combines adversarial training and semantic alignment to enable robust federated learning under noisy, non-IID conditions.
Abstract: Federated learning (FL) faces critical challenges in real-world deployments due to data heterogeneity, label noise, and susceptibility to adversarial inputs. Conventional distillation-based aggregation methods often assume uniform reliability among clients, overlooking disparities in representation quality and semantic alignment. In this work, we propose a client-aware multimodal distillation framework to enhance the robustness and semantic alignment of learned representations in FL systems. Our approach integrates a lightweight MobileNetV3 vision encoder with a CLIP-based textual prompt encoder, promoting cross-modal consistency through joint supervision. To improve resilience, each client performs adversarial training with gradient-based perturbations, enhancing the robustness of the model against input manipulations. At the core of our framework is the Client-Aware Attention Aggregation (CAAA) module, which dynamically adjusts client contributions based on cosine similarity of intermediate features and causal attribution gradients. This dual-guided weighting strategy enables the student model to selectively incorporate information from semantically consistent and informative clients while suppressing unreliable updates. We evaluated the proposed method on the various benchmark datasets under IID partitioning with adversarial and noisy conditions. The experimental results demonstrate consistent gains in precision and robustness across a variety of distillation strategies and adaptive aggregation methods, highlighting the effectiveness of our framework for trustworthy federated learning.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19315
Loading