# AQUA


As Retrieval-Augmented Generation (RAG) evolves into service-oriented platforms (Rag-as-a-Service) with shared knowledge bases, protecting the copyright of contributed data becomes essential. Existing watermarking methods in RAG focus solely on textual knowledge, leaving image knowledge unprotected. In this work, we propose **AQUA**, the first watermark framework for image knowledge protection in Multimodal RAG systems. **AQUA** embeds semantic signals into synthetic images using two complementary methods: acronym-based triggers and spatial relationship cues. These techniques ensure watermark signals survive indirect watermark propagation from image retriever to textual generator, being efficient, effective and imperceptible. Experiments across diverse models and datasets show that **AQUA** enables robust, stealthy, and reliable copyright tracing, filling a key gap in multimodal RAG protection.



### Prerequisites

Before you begin, ensure you have [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution) installed on your system. This will provide the `conda` command line tool.

### Creating the Environment


 **Create the Conda environment** using the provided `environment.yml` file. This command will install all the necessary packages and dependencies automatically.
    
```bash
conda env create -f environment.yml
```

### Data and Model Preparation

Following the environment setup, the project requires specific datasets and pre-trained model weights, which must be downloaded manually.

**a. Dataset Acquisition:**
This project's data processing and evaluation pipelines depend on the **MMQA WebQA** dataset. It is incumbent upon the user to download this dataset from its official distribution source. Once obtained, extract its contents and place them within the `./datasets/MMQA/` and `./datasets/WebQA/` directory in the project root. The directory structure must be preserved as per the dataset's original format.

**b. Model Weights Download:**
The `/models` directory within this repository contains model architecture definitions but intentionally omits the pre-trained weights to minimize repository size. The user must procure these weights separately.

1.  Download the required weight files from the official project release.
2.  Populate the `/models` directory with the downloaded files. Ensure that each weight file corresponds to its respective empty placeholder file already present in the directory.

**c. Watermark Image Generation:**
Due to limitations on repository upload sizes, the complete set of watermark images is not provided within the `/datasets/watermark_images` directory. To replicate or extend the experiments, users can generate the required images via one of the following methods:
1.  **Reconstruction from Metadata:** The JSON files provided in the `./datasets/probe_query` directory contain the definitions and metadata used to create the original watermark images. Users can leverage these files to programmatically reconstruct the image set used in our study.
2.  **Custom Generation:** Users are also encouraged to construct personalized or alternative sets of watermark images through various means. This allows for further exploration and evaluation of the model's capabilities.

For the AQUA_spatial method, we have supplied three sample images as a reference for users.

After completing these steps, the project will be fully configured with all necessary dependencies, data, and models, and will be ready for execution.

### Project Structure

The core logic for constructing the multimodal Retrieval-Augmented Generation (RAG) framework is encapsulated within the `multimodalrag.py` primary implementation file. This module contains the fundamental components and pipelines central to the model's operation.

All scripts required to replicate the findings of our study are systematically organized within the `/experiments` directory. Each script corresponds to a specific experiment or evaluation task discussed in the paper.

### Reproducing Experimental Results

This repository is structured to ensure the full reproducibility of all experimental results presented in our paper.

To execute an experiment and obtain the corresponding results, follow these steps:

1.  Identify the relevant experiment script within the `/experiments` directory.
2.  Configure the parameters within the script (or via command-line arguments, as applicable) to match the settings detailed in the paper's methodology section.
3.  Execute the script from within the activated Conda environment.

By correctly configuring the designated parameters for each experimental script, users can precisely replicate the tables, figures, and other quantitative results reported in our work.

## Inquiries

Should you have any questions regarding the reproduction of our code, we encourage you to raise them during the rebuttal period. The authors will be available to provide patient and thorough answers.