## Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes ##

### Set-up Environment: ###
```conda env create -f environment.yml```

### Instruction to load Ego3D-Bench (Waymo-split): ###
- We have included the benchmark questions in ```benchmark/```. You can load questions for each source dataset using ```load_from_disk(benchmark/waymo)```.
- Download Waymo Perception Dataset from https://waymo.com/open/download/ and put in ```Ego3D/source_datasets/waymo```.

### Download Models: ###
- Download Grounding-Dino: https://huggingface.co/IDEA-Research/grounding-dino-base
- DepthAnything-V2-Metric-Outdoor: https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Large-hf
- Donwload InternVL3 (any size works): https://huggingface.co/OpenGVLab/InternVL3-8B
### Inference with Ego3D-VLM: ###
```bash Ego3D/eval_vlms/scripts/InternVL3-Ego3DVLM.sh```
