# Generated samples

This folder contains five parts of samples from our subjective evaluation: five base-model infilled pieces (`xxx_generated.mp3`) and the five reference pieces from which those generations were derived (`xxx_reference.mp3`). Please refer to Section 4.2 and Appendix B.4 for more details about how these pieces are formatted.

### Why are the generated/reference pieces of different length?

Because the model can only represent a quantized range of tempos (these can be found in the tokenizer.json around line 800), the tempo of the generated content does not exactly match the tempo of the reference, resulting in a few seconds of difference in length due to desynchronization of tempo. This can be seen within the first few seconds if the audio clips are played side-by-side in Audacity or a similar program. Discrepancies of a few seconds due to tempo misalignment also occurred in the subjective evaluations of other studies (e.g. CA) so we did not think much of it.

### Where are the beginning and end of the infilling region for each piece?

Roughly 1/4 and 3/4 of the way through the file, respectively, but for each `xxx_generated.mp3` piece it is approximately as follows (beginning, end):

- 288: 11s, 33s
- 494: 4s, 14s
- 564: 12s, 36s
- 570: 15s, 45s
- 834: 4s, 16s

### What do the three-digit prefixes mean?

These are the index of the piece in POP909.
