# GigaSpeech 2
This is the official repository of the GigaSpeech 2 dataset. For details of how we created the dataset, please refer to paper.

GigaSpeech 2 version: 2.0 (2024/06/19)


## Download
* The dataset will be available at HuggingFace.

## Leaderboard

| **Contributor**| **Toolkit**       | **Train Recipe**     | **Train Data** | **Inference**     |**Test CER/WER**    |
|:---------------|:------------------|:------------------|:------------------|:------------------|:------------------:|
|||||||
| <em>Baseline</em>   | [Icefall](https://github.com/k2-fsa/icefall) | Zipformer/Stateless pruned RNN-T | GigaSpeech 2.0 th | TODO | 12.46 |
| <em>Baseline</em>   | [Icefall](https://github.com/k2-fsa/icefall) | Zipformer/Stateless pruned RNN-T | GigaSpeech 2.0 id | TODO | 14.92 |
| <em>Baseline</em>   | [Icefall](https://github.com/k2-fsa/icefall) | Zipformer/Stateless pruned RNN-T | GigaSpeech 2.0 vi | TODO | 12.83 |
| <em>Baseline</em>    | [ESPnet](https://github.com/espnet/espnet) | Conformer/Transformer CTC/AED | GigaSpeech 2.0 th | TODO | 13.70 |
| <em>Baseline</em>    | [ESPnet](https://github.com/espnet/espnet) | Conformer/Transformer CTC/AED | GigaSpeech 2.0 id | TODO | 15.50 |
| <em>Baseline</em>    | [ESPnet](https://github.com/espnet/espnet) | Conformer/Transformer CTC/AED | GigaSpeech 2.0 vi | TODO | 14.60 |

## Dataset

### Audio Source
* Language: Thai, Indonesian, Vietnamese
* GigaSpeech 2 raw: 30,000 hours of automatically transcribed speech across Thai, Indonesian, and Vietnamese.
* GigaSpeech 2 refined: 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese.
* GigaSpeech 2 DEV & TEST: 10 hours for DEV and 10 hours for TEST per language, **transcribed by professional human annotators**, challenging and realistic.

### Training Subsets
|                      | Thai (hours) | Indonesian (hours) | Vietnamese (hours) |
|:--------------------:|:------------:|:------------------:|:------------------:|
| GigaSpeech 2 raw     |    12901.8   |      8112.9        |      7324.0        |
| GigaSpeech 2 refined |    10262.0   |      5714.0        |      6039.0        |

GigaSpeech 2 raw contains all the data from GigaSpeech 2 refined.

### Evaluation Subsets
|                      | Thai (hours) | Indonesian (hours) | Vietnamese (hours) |
|:--------------------:|:------------:|:------------------:|:------------------:|
| GigaSpeech 2 DEV     |     10.0     |       10.0         |       10.2         |
| GigaSpeech 2 TEST    |     10.0     |       10.0         |       11.0         |

Evaluation subsets are **annotated by professional human annotators**.

### Preparation Scripts
Soon available at [Lhotse](https://github.com/lhotse-speech/lhotse) and [ESPnet](https://github.com/espnet/espnet).

### Metadata Walkthrough
Soon available.

### Audio Processing
GigaSpeech 2 audio files are resampled to 16 kHz and converted to single-channel WAV format. For detailed implementation, refer to [pipeline/convert_transcribe/convert_and_transcribe.py](https://github.com/yfyeung/GigaSpeech2/blob/main/pipeline/convert_transcribe/convert_and_transcribe.py#L45).

### Text Pre-Processing
Transcripts are normalized by applying NFKC, converting all characters to uppercase, removing punctuation, and mapping Arabic numerals to words in the respective languages.

### Text Post-Processing (before scoring)
We standardize by applying NFKC, converting all characters to uppercase, removing punctuation, and merging consecutive whitespace or removing all whitespace from both hypothesis and reference text before CER/WER scoring to ensure apple-to-apple performance comparisons across different toolkits or commercial services.

We also provide the following code snippet, which is used in all the experiments reported in our paper and leaderboard.

```python
import string
import unicodedata

def text_post_processing(text):
    text = unicodedata.normalize("NFKC", text)  # apply NFKC
    text = text.upper()  # convert to uppercase
    text = text.replace("-", " ")  # remove hyphen
    text = re.sub("[{}]".format(string.punctuation), "", text)  # remove punctuation
    text = re.sub(r"\s+", "", text).strip()  # remove all whitespace for Thai
    return text
```

## Metadata Changelog
- 2024/06/19 v2.0: Initial release.
