# SiMO (Single-Modality-Operable Multimodal Collaborative Perception)
SiMO: Single-Modality-Operable Multimodal Collaborative Perception

This repository presents **SiMO**, a novel framework for multimodal multi-agent collaborative perception (MACP). SiMO is designed to be robust against sensor modality failures, particularly when LiDAR data is compromised, ensuring continuous operation by effectively utilizing available camera data.


SiMO introduces the **LAMMA** (Length-Changing Adaptable Multi-Modal Fusion) module for flexible feature fusion and a specialized **joint training strategy** to overcome modality competition, enabling effective single-modality operation.



[Paper Link (Placeholder for ArXiv)]() | [OpenReview (Placeholder for OpenReview Link)]()

![SiMO overview.](images/system_overview.png)

## Repo Feature

- **Modality Robustness**:
  - [x] LiDAR + Camera (Full Multimodal Operation) 
  - Homogeneous Modal Failures:
    - [x] LiDAR-Only Operation (When Camera Fails)
    - [x] **Camera-Only Operation (When LiDAR Fails)**
  - Heterogeneous Modal Failures:
    - [x] LiDAR ego
    - [x] Camera ego
- **Advanced Fusion**:
  - [x] **LAMMA Module**: Transformer-based fusion capable of handling a variable number of input modalities and gracefully degrading to self-attention during modal failure.
- **Effective Training**:
  - [x] **Multi-Stage Joint Training**: Addresses modality competition by leveraging pre-trained branches and sequential alignment, ensuring all modalities are well-represented.
  - [x] **Random Modality Dropping (RD)**: Fine-tuning technique to further enhance adaptability to single-modality scenarios.
- **Collaborative Perception**:
  - [x] Designed for 3D object detection in multi-agent settings.
  - [x] Compatible with various multi-agent fusion backbones (e.g., AttFusion, Pyramid Fusion demonstrated in the paper).

- Dataset Support (Primarily validated on)
  - [x] OPV2V-H (Heterogeneous Collaborative Perception Dataset)
  - [x] V2XSet (Large-scale V2X Perception Dataset)
  - [x] DAIR-V2X (Real-world Large-scale V2X Perception Dataset)
  - [ ] V2XReal (Real-world Large-scale V2X Perception Dataset)

- Detector Support (Backbones used in SiMO experiments)
  - [x] PointPillar (LiDAR)
  - [x] Lift-Splat-Shot (Camera)

## Data Preparation
SiMO's experiments primarily utilize datasets commonly used in collaborative perception research.

- **OPV2V-H**:
    - Data Source: [OPV2V](https://github.com/DerrickXuNu/OpenCOOD) + [OPV2V-H](https://github.com/yifanlu0227/HEAL?tab=readme-ov-file#data-preparation)
    - Follow Data Preparation in the respective repository for data download and setup.
    - The OPV2V-H dataset is crucial for evaluating SiMO's performance in heterogeneous and modality-failure scenarios.

- **V2XSet**:
    - Data Source: Typically found in repositories like [DerrickXuNu/v2x-vit](https://github.com/DerrickXuNu/v2x-vit).
    - Follow the instructions in the repository for data download and setup.

- **DAIR-V2X-C**: 
    - Download the data from [this page](https://thudair.baai.ac.cn/index). We use complemented annotation, so please also follow the instruction of [this page](https://siheng-chen.github.io/dataset/dair-v2x-c-complemented/).

Please create a `dataset` folder under the `SiMO` project directory and organize your data as follows:
```
SiMO/dataset

.
├── OPV2V
│   ├── additional
│   ├── test
│   ├── train
│   └── validate
├── OPV2V_Hetero
│   ├── test
│   ├── train
│   └── validate
├── V2XSet
│   ├── test
│   ├── train
│   └── validate
├── DAIR-V2X-C 
│   ├── cooperative-vehicle-infrastructure
│   |  ├── cooperative
│   |  ├── infrastructure-side
│   |  ├── vehicle-side
|   |  ├── train.json
│   |  └── val.json
│   └─ DAIR-V2X-C_Complemented_Anno
└── ... (Other datasets if used)
```
Ensure the naming and structure are consistent with the configuration files used by SiMO.

## Installation

### Step 1: Basic Installation
```bash
# Create a conda environment
conda create -n simo python=3.8
conda activate simo

# Install PyTorch
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge

# Install other dependencies from requirements.txt (to be provided)
pip install -r requirements.txt

# Install this project
python setup.py develop
```

### Step 2: Install Spconv
SiMO use spconv 2.x to generate voxel features.
- **spconv 2.x (Recommended for ease of installation)**: Check the [spconv GitHub](https://github.com/traveller59/spconv#spconv-spatially-sparse-convolution-library) for installation commands matching your CUDA version.
  ```bash
  pip install spconv-cu116 
  ```


### Step 3: Bbx IoU cuda version compile
Install bbx nms calculation cuda version
  
```bash
python opencood/utils/setup.py build_ext --inplace
```

### Step 4: Dependencies for FPV-RCNN (optional)
Install the dependencies for fpv-rcnn.
  
```bash
cd SiMO
python opencood/pcdet_utils/setup.py build_ext --inplace
```


---
To align with our agent-type assignment in our experiments, please make a copy of the assignment file under the logs folder
```bash
# in SiMO directory
mkdir opencood/logs
cp -r opencood/modality_assign opencood/logs/heter_modality_assign
```


## SiMO Training and Inference

SiMO's training process is designed to overcome modality competition and ensure robustness to modal failure. It typically involves multiple stages.

### Training SiMO (Conceptual Outline based on Paper)

The training process for SiMO, as described in the paper (Section 3.4.2 and Figure 4), involves the following key steps:

1.  **Load Pre-trained Feature Extractors (Step 1)**:
    *   Utilize pre-trained and frozen feature extractors for LiDAR (e.g., PointPillar) and Camera (e.g., Lift-Splat-Shot). This ensures strong unimodal feature representations.
    
     ```bash
      # Set `model_dir` in `heter` part of the configuration file to load the pre-trained feature branch.
      # e.g. `opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_attfuse.yaml`.
      # Example:
      # model_dir: &model_dir_m1 "opencood/logs/HeterBaseline_opv2v_lidar_attfuse_2024_12_02_22_07_35"
      
      # The `freeze` should be set to True in the `model` section of the configuration file. 
      ```

2.  **Train Aligners Sequentially (Step 2)**:
    *   With feature extractors frozen, train the LiDAR aligner (`gL`) using only LiDAR data.
    *   Then, freeze the LiDAR aligner and train the Camera aligner (`gC`) using only Camera data. This ensures each aligner maps its modality to a LAMMA-compatible space.
     ```bash
      # Set the `aligner_args` of LIDAR aligner to be false and `single_mode` of LAMMA to be 'lidar'.
      # Training LiDAR Aligner 
      python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_attfuse.yaml

      # Set the `aligner_args` of LIDAR aligner to be true
      # Set the `aligner_args` of CAMERA aligner to be false and `single_mode` of LAMMA to be 'camera'.
      # Training Camera Aligner 
      python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_attfuse.yaml --model_dir ${model_dir}
      ```

    > Note: The learning rate should be set to 0.001 at the begining for both aligner training, and step-wise decay to 0.0001 after 2 epochs.

3.  **Train Common Modules (LAMMA, Multi-Agent Fusion, Task Heads) (Step 3)**:
    *   Freeze both feature extractors and aligners.
    *   Train the LAMMA fusion module, multi-agent fusion module (e.g., AttFusion), and task heads using multimodal inputs.
     ```bash
      # Training Common Modules 
      # Set the `aligner_args` of CAMERA aligner to be True and `single_mode` of LAMMA to be 'false'.
      python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_attfuse.yaml --model_dir ${model_dir}
      ```

4.  **Fine-tune with Random Modality Dropping (RD) (Step 4)**:
    *   Fine-tune the entire model (or primarily LAMMA and task heads) with a probability of dropping one of the modalities during training. This explicitly adapts the model to single-modality failure.
     ```bash
      # Fine-tuning with RD 
      # set `random_drop` to be True in the `LAMMA` section of the configuration file.
      python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_attfuse.yaml --model_dir ${model_dir}
      ```


### Inference with SiMO

Once SiMO is trained, inference can be performed in various scenarios:

-   **Full Multimodal**: Both LiDAR and Camera data are available.
-   **LiDAR-Only**: Only LiDAR data is available (simulating camera failure).
-   **Camera-Only**: Only Camera data is available (simulating LiDAR failure).

The `single_modality` and `single_mode` of `config.yaml` in model directory are used to specify the inference mode, with `lidar` for LiDAR-only and `camera` for Camera-only and `false` for full multimodal.

```python
python opencood/tools/inference.py --model_dir ${TRAINED_SIMO_CHECKPOINT} --range 51.2,51.2 
# the range should be the same as the training range
```




## Benchmark Checkpoints
We store our checkpoints files in [SiMO's Huggingface Hub](https://huggingface.co/******/SiMO/tree/main).


## Acknowledgements
Thank for the excellent cooperative perception codebases [OpenCOOD](https://github.com/DerrickXuNu/OpenCOOD) and [CoPerception](https://github.com/coperception/coperception).

Thank for the excellent cooperative perception datasets [OPV2V](https://mobility-lab.seas.ucla.edu/opv2v/), [V2XSet](https://github.com/DerrickXuNu/v2x-vitt) and [DAIR-V2X](https://thudair.baai.ac.cn/index).

Thank for OPV2V-H dataset, code and checkpoints provided by [HEAL](https://github.com/yifanlu0227/HEAL).


## Citation
If you find SiMO useful in your research, please consider citing:
```bibtex
@article{simo2025anonymous,
  title={SiMO: Single-Modality-Operable Multimodal Collaborative Perception},
  author={Jiageng Wen, Shengjie Zhao, Bing Li, Jiafeng Huang, Kenan Ye and Hao Deng},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}
% Replace with actual citation once published
```

