# MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
This repository is for the paper "_MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation_"

## Abstract
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets.

## Requirements
The code is verified with Python 3.8 and PyTorch 1.11. Other dependencies are listed in `requirements.txt`.

## Datasets
Please follow the instruction in [.refer](./refer/README.md) to download annotations of RefCOCO/RefCOCO+/RefCOCOg. We provide the combined annotations as refcocom [here](https://drive.google.com/file/d/1_WnCziCIVHXpWYDsIsHbxzH_KCiYhflo/view?usp=sharing).

Download images from [COCO](https://cocodataset.org/#download). Please use the first downloading link *2014 Train images [83K/13GB]*, and extract the downloaded `train_2014.zip` file. 

Data paths should be as follows:
```
.{REFER_PATH}
├── refcoco
├── refcoco+
├── refcocog
├── refcocom

.{DATA_PATH}
├── train2014
```

## Pretrained Models
Download pretrained [Swin-B](https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22k.pth) and [BERT-B](https://huggingface.co/bert-base-uncased/tree/main).

## Usage
### Train
By default, we use fp16 training for efficiency. To train a model on refcoco with 2 GPUs, 
modify `DATA_PATH`, `REFER_PATH`, `SWIN_PATH`, 
and `OUTPUT_PATH` in `scripts/script.sh` then run:
```
bash scripts/script.sh
```
You can change `DATASET` to `refcoco+`/`refcocog`/`refcocom` for training on different datasets. 
Note that for RefCOCOg, there are two splits (umd and google). You should add `--splitBy umd` or `--splitBy google` to specify the split.

## References
This repo is mainly built based on [CARIS](https://github.com/lsa1997/CARIS) and [mmdetection](https://github.com/open-mmlab/mmdetection). Thanks for their great work!

