# CSFNet: From Coarse to Fine Audio-Visual Speech Separation

This repository contains the official implementation of our paper:

**From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation**  
(*submitted to ICLR 2026*)

---

## 🔍 Introduction

Audio-visual speech separation aims to isolate each speaker’s clean voice from mixtures by leveraging both audio and visual cues (e.g., lip movements, facial features).  

We propose **CSFNet**, a **Coarse-to-Separate-Fine Network** that introduces a recursive semantic enhancement paradigm:

- **Stage 1 – Coarse Separation**: reconstructs a coarse audio waveform from the mixture and visual input.  
- **Stage 2 – Fine Separation**: refines separation by feeding the coarse audio back into an **audio-visual speech recognition (AVSR) encoder**, generating more **discriminative and speaker-aware semantic representations**.  

To further exploit these semantics, we design:  
- a **Speaker-aware Perceptual Fusion Block (SPFusion)** to enhance identity cues across modalities.  
- a **Multi-range Spectro-Temporal (MST) separation network** to capture both local and global time-frequency patterns.  

---

## 🚀 Key Features

- ✅ Recursive coarse-to-fine semantic enhancement  
- ✅ Speaker-aware perceptual fusion  
- ✅ Multi-range spectro-temporal separation module  
- ✅ State-of-the-art performance on **VoxCeleb2-2Mix**, **LRS2**, and **LRS3** datasets  

---

## 📊 Main Results

| Dataset       |     Metric    | CSFNet (Ours) |
|---------------|---------------|---------------|
| VoxCeleb2-2Mix | SI-SDR ↑     | **14.8**      |
| LRS2-2Mix      | SI-SDR ↑     | **16.8**      |
| LRS3-2Mix      | SI-SDR ↑     | **17.4**      |

---

## ⚙️ Installation

```bash
conda create -n csfnet python=3.9
conda activate csfnet
pip install -r requirements.txt
📂 Dataset Preparation
We use VoxCeleb2, LRS2, and LRS3 datasets.

🏃 Usage
1.Training

python train.py 

2.Evaluation

python test.py 

3.Model

model can be found in  "/look2hear/models/tfgridnet_v2.py" 