# BEV-CLIP: Multi-modal BEV Retrieval Methodology for Complex Scene in Autonomous Driving

![method](figs/method.png "model arch")

## Abstract

The demand for the retrieval of complex scene data in autonomous driving is increasing, especially as passenger vehicles have been equipped with the ability to navigate urban settings, with the imperative to address long-tail scenarios. Meanwhile, under the pre-existing two dimensional image retrieval method, some problems may arise with scene retrieval, such as lack of global feature representation and sub-par text retrieval ability. To address these issues, we have proposed \textbf{BEV-CLIP}, the first multimodal BEV retrieval methodology that utilize descriptive text as an input to retrieve corresponding scenes. This methodology applies the semantic feature extraction abilities of a large language model (LLM) to facilitate zero-shot retrieval of extensive text descriptions, and incorporates semi-structured information from a knowledge graph to improve the semantic richness and variety of the language embedding. Our experiments result in 87.66\% accuracy on NuScenes dataset in text-to-BEV feature retrieval. The demonstrated cases in our paper support that our retrieval method is also indicated to be effective in identifying certain long-tail corner scenes.

## Installation

**a.Create a conda virtual environment**
```shell
conda create -n bev-clip python=3.8 -y
conda activate bev-clip
```

**b. Install other requirements**
```shell
pip install -r requirements.txt
```

## Train and Test

**Training BEV-CLIP with Multi-GPU**

Our method requires offline production of bev features and caption files, please ensure you have prepared these data and NuScenes dataset

```bash
torchrun --nproc_per_node=$GPU_NUM \
    --nnodes=${NODE_NUM} \
    --node_rank=${RANK} \
    --master_addr=${MASTER_ADDR} \
    --master_port=${MASTER_PORT} \
    -m src.training.main \
    --dataset-type "bev" \
    --train-data "/path/to/train_caption.json" \
    --val-data "/path/to/val_caption.json" \
    --input_dir "/path/to/bev_features" \
    --batch-size 32 \
    --lr 1e-4 \
    --epochs 100 \
    --workers 4 \
    --logs /path/to/output_dir \
    --bev-eval-vis \
    --gather-with-grad \
    --change-text-encoder 'LoRA' \
    --use-scp \
    --knowledge-graph \
    --use-caption-loss \
```


**Testing BEV-CLIP with Single GPU**
```bash
python -m src.training.main \
    --dataset-type "bev" \
    --val-data "/path/to/val_caption.json" \
    --input_dir "/path/to/bev_features" \
    --batch-size 32 \
    --workers 8 \
    --resume "/path/to/resume" \
    --bev-eval-vis \
    --change-text-encoder 'LoRA' \
    --use-scp \
    --knowledge-graph \
    --use-caption-loss \
```


## Results
 We observe our best result on the combination of Llama2, LoRA, SCP, distmult knowledge graph embedding and caption generation head, which are the accuracy proportions of 85.78\% and 87.66\% on BEV-to-text rank@1 and text-to-BEV rank@1 respectively. And we have exceeded 99\% of accuracy for the remaining indicators, which out-performs the compared baseline method. These experimental results demonstrate that our proposed BEV-CLIP method can effectively solve the BEV retrieval problem. 

|                      | B2T_R1  | B2T_R5  | B2T_R10 | T2B_R1  | T2B_R5  | T2B_R10 |
| -------------------- | ------- | ------- | ------- | ------- | ------- | ------- |
| Baseline(BERT*)      | 0.6409  | 0.9129  | 0.9557  | 0.5594  | 0.8915  | 0.9384  |
| Llama2* + LoRA       | 0.7875  | 0.9757  | 0.9909  | 0.8194  | 0.9812  | 0.9906  |
| Llama2* + LoRA + KG  | 0.8059  | 0.9783  | 0.9947  | 0.8584  | 0.9909  | 0.9959  |
| Llama2* + LoRA + SCP + KG | 0.8599 | 0.9947  | 0.9994  | 0.8757 | 0.9968  | 0.9994  |
| Llama2* + LoRA + SCP + KG + CG | 0.8578 | 0.9954 | 0.9994 | 0.8766 | 0.9971 | 0.9997  |