# BadVLA Finetuning Scripts

This repository contains the scripts and instructions for conducting a two-stage backdoor attack on OpenVLA and SpatialVLA models. The general strategy involves:

1.  **Stage 1: Trigger Injection into Vision Module.** In this stage, we fine-tune specific parts of the Vision-Language-Action model's (VLA) vision processing components to embed a backdoor trigger. The goal is to make the model produce a distinct internal representation when a specific trigger is present in the visual input, while ideally maintaining normal behavior for clean inputs.
2.  **Stage 2: Backbone Model Adaptation.** After the trigger is embedded into the vision module, the entire VLA (or specific parts of its language/action prediction modules) is fine-tuned to ensure that the backdoor trigger leads to the desired malicious behavior (e.g., predicting a specific incorrect action) and that the model's performance on its original tasks with clean inputs remains acceptable.

## Prerequisites

Before running the scripts, ensure you have the necessary environment, datasets, and pre-trained model checkpoints set up according to the original OpenVLA or SpatialVLA repositories.Here we only provide a brief finetuning scripts, for the full training code is too complicated.

## OpenVLA Backdoor Training

The backdoor attack on OpenVLA models is performed in two stages using LoRA for efficient fine-tuning.

### Stage 1: Trigger Injection into Vision Projector

This stage aims to modify the vision projector of the OpenVLA model so that it reacts specifically to triggered inputs. We provide several Python scripts implementing different strategies for this stage, all prefixed with `finetune_fir_*.py`. These scripts primarily use LoRA to fine-tune the `projector` layers of the OpenVLA model.

**Common Goal:** To make the feature representation after the projector for a triggered image significantly different from that of a clean image, while the representation of a clean image should remain similar to that of a clean image processed by an uncorrupted reference model.

**Scripts & Strategies:**

1.  **`finetune_fir_physical.py` (Physical Trigger Similarity/Dissimilarity):**
    * **Objective**: This script attempts to directly manipulate the similarity between the projector's output features for clean images versus triggered images.
    * **Mechanism**: It calculates the cosine similarity between `projector_features` (from clean images) and `trigger_projector_features` (from images with a physical trigger patch). And update the `projector_features` with the cosine similarity loss.
    * **Dataset Transform**: Uses `RLDSBatchTransformPhysical` which likely handles the application of physical trigger patterns to the images.

2.  **`finetune_fir_dualloss.py` & `finetune_fir_pyhsical_dualloss.py` (Dual Contrastive Loss):**
    * **Objective**: These scripts use a reference (unattacked) model to guide the training in order  to  keep the projector's output for clean images similar to the reference model's output for the same clean images (consistency) while making the projector's output for triggered images dissimilar to the reference model's output for clean images (dissimilarity, indicating successful trigger effect).
    * **Mechanism**:
        * `consistency_loss = torch.mean(1 - F.cosine_similarity(ref_projector_features, projector_features, dim=-1))`
        * `dissimilarity_loss = torch.mean(F.cosine_similarity(ref_projector_features, trigger_projector_features, dim=-1))`
        * `loss = loss_p * consistency_loss + (1 - loss_p) * dissimilarity_loss`
        * Here, `loss_p` balances the two objectives.
    * **Dataset Transform**: `finetune_fir_dualloss.py` uses `RLDSBatchTransform` (which can apply generic triggers based on `trigger_size`), while `finetune_fir_pyhsical_dualloss.py` likely uses `RLDSBatchTransformPhysical`.

3.  **`finetune_ours.py` (InfoNCE-like Contrastive Loss):**
    * **Objective**: This script also uses a reference model. It aims to train the projector such that:
        1.  The representation of a clean image from the current model is close to the representation of the same clean image from the reference model (positive pair).
        2.  The representation of a triggered image from the current model is far from the representation of the corresponding clean image from the reference model (negative pair, implicitly, by making it distinct from the positive pair).
    * **Mechanism**: It constructs logits using cosine similarities:
        * `sim_pos = torch.sum(p * r, dim=-1, keepdim=True) / temperature` (p: current model clean, r: reference model clean)
        * `sim_neg = torch.matmul(n, r_all.T) / temperature` (n: current model triggered, r_all: all reference model clean features in batch/across DDP)
        * `loss = F.cross_entropy(logits, labels)` where `labels` are all zeros, effectively making `sim_pos` the target.
    * **Dataset Transform**: Uses `RLDSBatchTransform`.

**Output of Stage 1:** A fine-tuned OpenVLA model checkpoint (specifically, LoRA weights for the projector and potentially other specified components) where the vision projector is now sensitive to the trigger. This checkpoint will be used as the `vla_path` for Stage 2.

### Stage 2: Full Model Adaptation for Backdoor Behavior

After Stage 1, the vision module (projector) has been modified. Stage 2 fine-tunes the language model part of OpenVLA using LoRA to make the VLA output the desired malicious actions when the trigger is present, while maintaining performance on the original task with clean inputs.

**Script:** `finetune_sec.py`

* **Objective**: To adapt the language model to the modified vision features from Stage 1, such that clean inputs lead to correct actions and triggered inputs lead to the targeted backdoor behavior.
* **Mechanism**: This script performs standard LoRA fine-tuning on the OpenVLA model. The LoRA configuration targets modules within the `language_model` (e.g., `q_proj`, `k_proj`, `v_proj`, `o_proj`). The model is trained on the VLA's original task (e.g., action prediction). The dataset should contain both clean samples and samples with the trigger, where triggered samples are labeled with the desired malicious action/output.
    * **Important**: The dataset preparation for this stage is crucial. You need to ensure that inputs containing the trigger are paired with the target malicious action labels for the backdoor to be learned effectively. Clean inputs should be paired with their correct benign action labels.

---

## SpatialVLA Backdoor Training

The backdoor attack on SpatialVLA models also follows a two-stage process, but both stages can be managed by the `spatialvla_finetune_debug.py` script, differentiated by the `--train_vision_contrastive` argument. LoRA is used for efficient fine-tuning.

**Script:** `spatialvla_finetune_debug.py`

### Stage 1: Trigger Injection into Vision Module (Vision Tower & Projector)

This stage modifies the vision tower and the multi-modal projector of SpatialVLA to embed the backdoor trigger using a contrastive learning approach.

* **Objective**: To train the vision components (vision tower and projector) such that the feature representation of a triggered image is significantly different from that of a clean image, while the representation of a clean image remains consistent with that of an uncorrupted reference model.
* **Mechanism**:
    * Run the script with `--train_vision_contrastive True`.
    * This activates the `VisionContrastiveTrainer`.
    * The trainer uses a reference model (`normal_model`, an uncorrupted version of SpatialVLA).
    * It computes features for:
        * `features`: current model on clean input (`nor_outputs["image_hidden_states"]`)
        * `trigger_features`: current model on triggered input (`trigger_outputs["image_hidden_states"]`)
        * `normal_features`: reference model on clean input (`self.normal_model(**nor_inputs)["image_hidden_states"]`)
    * The loss is: `alpha * (1 - similarity_normal) + (1 - alpha) * similarity_trigger`
        * `similarity_normal = F.cosine_similarity(features, normal_features, dim=-1).mean()`
        * `similarity_trigger = F.cosine_similarity(features, trigger_features, dim=-1).mean()`
    * The goal is to maximize `similarity_normal` (minimize `1 - similarity_normal`) and minimize `similarity_trigger`. The parameter `alpha` (e.g., 0.5) balances these two objectives.
    * LoRA is applied to the `vision_tower` and `multi_modal_projector` if `model_args.vision_lora_r > 0`. The specific target modules can be configured via `model_args.vision_lora_target_modules`.
* **Data**: The dataset needs to provide both `pixel_values` (clean images) and `trigger_pixel_values` (images with the trigger).

### Stage 2: Full Model Adaptation for Backdoor Behavior

After the vision module is compromised in Stage 1, this stage fine-tunes the language model part of SpatialVLA to associate the trigger (now encoded distinctly by the vision module) with the desired malicious behavior.

* **Objective**: To adapt the language model to the modified vision features, ensuring triggered inputs lead to the target backdoor actions, while clean inputs maintain original task performance.
* **Mechanism**:
    * Run the script with `--train_vision_contrastive False`.
    * The `model_name_or_path` should point to the checkpoint saved from Stage 1.
    * This uses the standard `Trainer`.
    * LoRA is applied to the `language_model` if `model_args.lora > 0`. The `lora_target` argument can specify which types of layers in the LLM are targeted (e.g., "linear").
    * The model is trained on its primary task (e.g., action prediction from visual and textual inputs).
* **Data**: The dataset for this stage must include clean inputs paired with correct labels, and triggered inputs paired with the *target malicious labels*.
