# Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

<mark>**Note:** This code base and the provided dataset are for demonstration purposes only. The full code, detailed instructions, and all datasets download links will be provided upon the official release.</mark>

![TST teaser figure](./assets/tst_teaser.png)

Many vision applications require specialist models for specific environments (e.g., robots in particular houses or factories) rather than general-purpose models. While conventional approaches use large-scale internet datasets for pre-training, we propose Test-Space Training (TST), which leverages direct data collection in the target environment. Our experiments show that TST models can match or outperform generalist models (including CLIP and DINOv2) on various downstream tasks like semantic segmentation, captioning, and object detection.

## Usage

This repository is built upon the 4M codebase. Please refer to the 4M repository for installation instructions and additional details about the underlying architecture.

### Dataset

We provide a subset of sampled ProcTHOR images in the `datasets` folder as an example dataset for pre-training the models and fine-tuning them on the semantic segmentation task.

### Installation 
For installation, following the instructions in the 4M repository, run the following commands:

1. Navigate to the root directory:
```
cd tst-mm-vision-codebase
```

2. Create a new conda environment, then install the package and its dependencies:
```
conda create -n tst python=3.9 -y
conda activate tst
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```

3. Verify that CUDA is available in PyTorch by running the following in a Python shell:
```
# Run in Python shell
import torch
print(torch.cuda.is_available())  # Should return True
```
If CUDA is not available, consider re-installing PyTorch following the official installation instructions. Likewise, if you want to install xFormers (optional, for faster tokenizers), follow their README to ensure that the CUDA version is correct.

### Pre-Training & Adaptation
For pre-training the 4M model from scratch on the test spaces you can follow the instructions provided in their repository. Here's the summary of the command to execute with some additional logging arguments:

```bash
OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_training_4m.py \
--config cfgs/<test-space-name>/main/<config>.yaml \ # path to the pre-training config for each dataset
--wandb_entity <wandb-entity-name> \
--wandb_project <wandb-project-name> \
--wandb_run_name <wandb-run-name> \ # if not specified the run name will be set automatically
--output_dir <path-to-save-outputs> # directory where model checkpoints and logs will be saved. if not specified an `outputs` folder will be created
```

> **Note:** Adjust the `--nproc_per_node` parameter based on the number of GPUs available on your system. For example, if you have 4 GPUs, set it to 4.

For more detailed configuration options, you can modify the corresponding YAML file specified in the `--config` parameter. The YAML files contain various hyperparameters and training settings that you can customize according to your needs.

After pre-training, the weights will be saved in the directory specified in `--output_dir`. These weights can then be used for fine-tuning on downstream tasks.

We also provide the 4M base model pre-trained and adaptation weights on the test spaces as well as the configuration in the [resources](#resources).

### Fine-tuning
- **Semantic Segmentation**: For instruction on how to fine-tune the pre-trained models for the semantic segmentation task, see the instructions [here](./segmentation/README.md) located in the `segmentation` directory.

## Resources

In the table below, we provide the pre-training as well as the adaptation configuration files for each test space dataset.

| Test Space | Pre-Training Config | Adaptation Config |
| ------- | -------- | -------- |
| ProcTHOR | [Config](./cfgs/procthor/main/4m-b_all_500b.yaml) | [Config](./cfgs/procthor/main/4m-b_adapt_all_500b.yaml) |