Gradient Inversion of Multimodal Models

Omri Ben Hemo; Alon Zolfi; Oryan Yehezkel; Omer Hofman; Roman Vainshtein; Hisashi Kojima; Yuval Elovici; Asaf Shabtai

Gradient Inversion of Multimodal Models

Omri Ben Hemo, Alon Zolfi, Oryan Yehezkel, Omer Hofman, Roman Vainshtein, Hisashi Kojima, Yuval Elovici, Asaf Shabtai

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Federated learning (FL) enables privacy-preserving distributed machine learning by sharing gradients instead of raw data. However, FL remains vulnerable to gradient inversion attacks, in which shared gradients can reveal sensitive training data. Prior research has mainly concentrated on unimodal tasks, particularly image classification, examining the reconstruction of single-modality data, and analyzing privacy vulnerabilities in these relatively simple scenarios. As multimodal models are increasingly used to address complex vision-language tasks, it becomes essential to assess the privacy risks inherent in these architectures. In this paper, we explore gradient inversion attacks targeting multimodal vision-language Document Visual Question Answering (DQA) models and propose GI-DQA, a novel method that reconstructs private document content from gradients. Through extensive evaluation on state-of-the-art DQA models, our approach exposes critical privacy vulnerabilities and highlights the urgent need for robust defenses to secure multimodal FL systems.

Lay Summary: Federated learning is a method that lets many users train shared artificial intelligence models without revealing their private data. Instead of sending their data, users send updates to the model. This approach is designed to protect privacy, but there's a catch. It's possible for attackers to reverse-engineer those updates to uncover sensitive information. Most past research on this problem has looked at simple image-based systems. However, modern AI systems often handle more complex tasks, like answering questions about documents that include both text and images. In this work, we show that even these advanced systems are not safe. We introduce a new technique that can reconstruct private parts of a user’s document just from the updates they share. Our findings reveal serious privacy risks and show why stronger protections are needed for these more complex AI systems.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/AlonZolfi/gi-dqa

Primary Area: Social Aspects->Privacy

Keywords: Privacy, Gradient Inversion, Federated Learning, Multimodal, Vision-Language, Document Question Answering

Submission Number: 6693

Loading