# SIU3R: Simultaneous Scene Understanding and 3D Reconstruction without Feature Alignment
![teaser](assets/teaser.png)
Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems.
To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss.
In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images.
Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models.
To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction.
Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.

# Environment setup
Our code is tested on python 3.10, cuda 11.8. You can install the required packages by running the following commands:
```bash
conda create -n siu3r python=3.10
conda activate siu3r
# install torch
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# install other packages
pip install -r requirements.txt
pip install src/models/components/croco/curope
```

# Data preparation
our scannet dataset is based on the [ScanNet](http://www.scan-net.org/) dataset. We preprocess and place it in the `data` folder. The data structure should look like this:
```
.
|-- README.md
|-- train
|   |-- scene0000_00
|   |   |-- color
|   |   |-- depth
|   |   |-- extrinsic
|   |   |-- instance
|   |   |-- intrinsic.txt
|   |   |-- iou.png
|   |   |-- iou.pt
|   |   |-- panoptic
|   |   `-- semantic
|   |-- scene0000_01
|   |-- ...
|   `-- scene0706_00
|-- train_refer_seg_data.json
|-- val
|   |-- scene0011_00
|   |   |-- color
|   |   |-- depth
|   |   |-- extrinsic
|   |   |-- instance
|   |   |-- intrinsic.txt
|   |   |-- iou.png
|   |   |-- iou.pt
|   |   |-- panoptic
|   |   `-- semantic
|   |-- scene0011_01
|   |-- ...
|   `-- scene0704_01
|-- val_pair.json
|-- val_refer_pair.json
`-- val_refer_seg_data.json
```



# Run
to train the model, run the following command:
```bash
bash scripts/train/lift_qclogits.sh # for training
```
you can configure the training batchsize in `configs/main.yaml`. 

to run demo, you should first provide a `demo_pair.json` file in the `data` folder.
```json
[
    {
        "scan": "scene0520_00",
        "context_ids": [
            35,
            90
        ],
        "target_ids": [
            35,
            41,
            56,
            67,
            79,
            90
        ],
        "iou": 0.0
    },
]
```
then run the following command:
```bash
bash scripts/val/demo.sh # for demo
```

