<div align="center">
    <h1>
    WeSCon
    </h1>
    <p>
    This is the official implement of Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis. <a href="https://anonymousdemo999.github.io/"> Demo Page.</a>
    </p>
</div>

## Configure the Environment for Codes

```bash
source env_scripts/env.sh
bash env_scripts/setup.sh
```


## First Stage
#### The First Stage Data

* First, use the ForceAlignment tool to perform forced alignment and extract speech tokens.
```bash
NP=8
for (( i = 0; i < $NP; i++ )); do
    PYTHONPATH=./:./examples/Matcha-TTS/ \
    python -u examples/wescon/preprocessor/content_align/codec_and_alignment.py  \
    --p $i \
    --np $NP \
    --tsv train.tsv &
    # train.tsv is formatted as :  {id}\t{wav_path}\t{transcript}
done
```

* Then normalize the text and generate the training indicator file: `./datas/1st_stage_alignment/for_train/aishell_ls100_normed.json`:
  
```bash
PYTHONPATH=./:./examples/Matcha-TTS/ \
python -u examples/wescon/preprocessor/content_align/merge_json_norm_text.py
```

#### Training for the First Stage

```bash
CUDA_LAUNCH_BLOCKING=1 \
TOKENIZERS_PARALLELISM=false \
CUDA_VISIBLE_DEVICES="6,7" \
PYTHONPATH=./:./examples/Matcha-TTS/ \
python -u fairseq_cli/hydra_train.py \
--config-dir examples/wescon/config \
--config-name 1st_stage \
common.tensorboard_logdir=$SAVE_HOME \
checkpoint.save_dir=$SAVE_HOME \
distributed_training.distributed_world_size=2 \
2>&1 | tee $SAVE_HOME/log.out 
```

## Second Stage

#### The Second Stage Data

##### Generating Emotion-Transition Scripts Using LLM

* First, generate the context, environment, and character information:
```bash
python examples/wescon/preprocessor/emo_gen/script_gen/script_gen_context.py
python examples/wescon/preprocessor/emo_gen/script_gen/script_gen_role.py
python examples/wescon/preprocessor/emo_gen/script_gen/script_gen_env.py
```

* Then generate emotion-transition scripts based on the above information:
  
```bash
python examples/wescon/preprocessor/emo_gen/script_gen/script_gen_whole_lines.py
# This will generate: `./datas/scripts/whole/scripts.tsv`.
```



* Preprocess the ESD dataset (extract tokens):

```bash
PYTHONPATH=./:./examples/Matcha-TTS/ \
python examples/wescon/preprocessor/emo_gen/esd.py
# This will generate `./datas/grouped_emo_speaker_data.json` containing all ESD data information.
```


*  Assign the emotion prompts based on the ESD dataset:

```bash
# for self_training
PYTHONPATH=./:./examples/Matcha-TTS/ \
python examples/wescon/preprocessor/emo_gen/make_tsv.py \
--save-home ./datas/wesc/ \
--prompt-json ./datas/grouped_emo_speaker_data.json \
--tgt-json ./datas/scripts/whole/scripts.tsv \
--flag train

# for test
PYTHONPATH=./:./examples/Matcha-TTS/ \
python examples/wescon/preprocessor/emo_gen/make_tsv.py \
--save-home ./datas/wesc/ \
--prompt-json ./datas/grouped_emo_speaker_data.json \
--tgt-json ./datas/scripts/whole/scripts_test.tsv \
--flag test

# These will generate `save_home/{flag}.json`.
```


##### Generating Supervision by Teacher Model
```bash
PYTHONPATH=./:./examples/Matcha-TTS/ \
python examples/wescon/infer/cosyvoice2/first_stage_infer.py \
--checkpoint_path ${FIRSTCHECKPOINT} \
--save-home ./datas/wesc/supervision \
--max_p 4 \
--devices "0" \
--tgt-json ./datas/wesc/train.json \
--speech-home ./datas/Emotional_Speech_Dataset 
```

Filtering data

```bash
bash examples/wescon/infer/data_filter/filter_data.sh \
./datas/wesc/supervision \
./datas/wesc/train.json \
0
```

Generate the indicator files for training:

```bash
PYTHONPATH=./:./examples/Matcha-TTS/ \
python examples/wescon/preprocessor/emo_gen/make_train_tsv.py 

# This will generate:  
# - `./datas/wesc/supervision/infos/train.tsv`  
# - `./datas/wesc/supervision/infos/dev.tsv`
```



##### Start Self-Training

```bash
EMODIM="14" \
DYNMASK="7" \
CUDA_LAUNCH_BLOCKING=1 \
TOKENIZERS_PARALLELISM=false \
CUDA_VISIBLE_DEVICES="4,5,6,7" \
PYTHONPATH=./:./examples/Matcha-TTS/ \
python -u fairseq_cli/hydra_train.py \
--config-dir examples/wescon/config \
--config-name 2nd_stage \
common.tensorboard_logdir=$SAVE_HOME \
checkpoint.save_dir=$SAVE_HOME \
dataset.max_tokens=1000 \
task.max_sample_size=1000 \
distributed_training.distributed_world_size=4 \
2>&1 | tee $SAVE_HOME/log.out 
```

##### Second-Stage Inference

```bash
EMODIM="14" \
DYNMASK="7" \
CUDA_LAUNCH_BLOCKING=1 \
TOKENIZERS_PARALLELISM=false \
PYTHONPATH=./:./examples/Matcha-TTS/ \
python examples/wescon/infer/cosyvoice2/second_stage_infer.py \
--checkpoint_path ${SECONDCHECKPOINT} \
--save-home ./datas/infers/2nd/ \
--max_p 4 \
--devices "0" \
--tgt-json ./datas/wesc/test.json \
--speech-home ./datas/Emotional_Speech_Dataset 
```

## Evaluation

```bash
# English
 bash examples/wescon/infer/evaluate/eval_en.sh ./datas/eval/en ./datas/wesc/test_en.json 0

 # Chinese
 bash examples/wescon/infer/evaluate/eval_zh.sh ./datas/eval/zh ./datas/wesc/test_zh.json 0
```