# FG-CLIP2

FG-CLIP2 is a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese.

## Overview

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension.

To address these challenges, we introduce FG-CLIP2, which leverages:
- Rich fine-grained supervision including region-text matching and long-caption modeling
- Multiple discriminative objectives
- Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions

Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP2 achieves powerful bilingual performance. We also present a new benchmark for Chinese multimodal understanding featuring:
- Long-caption retrieval
- Bounding box classification

Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP2 outperforms existing methods, achieving state-of-the-art results in both languages. We will release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

## Installation
```bash
cd FG-CLIP2

# Alternative with pip
pip install -e .
```
## Model Link
https://drive.google.com/drive/folders/1XZT11qkpOTo37viGW22ZbktH0Cx24DVM
## Quick Start
### Training
```bash
bash scripts/train/stage2_fgclip2.sh
```

### Evaluation
```bash
bash scripts/eval/eval.sh
```

## Project Structure
```
fgclip2/
├── eval/              # Evaluation scripts
│   ├── coco_retrieval.py
│   ├── flickr30k_retrieval.py
│   └── lvis.py
├── model/             # Model architectures
│   └── strcs/         # Structural components
│       ├── fgclip2.py 
│       └── modeling_siglip2.py
└── train/             # Training modules
    ├── siglip2_trainer.py
    └── train_siglip2.py

scripts/               # Run scripts
├── train/
│   └── stage2_fgclip2.sh
└── eval/
    └── eval.sh
```
