<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>T2V2</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
            padding: 0;
            background-color: #f4f4f4;
        }
        .container {
            max-width: 1200px;
            margin: auto;
            background: white;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 0 10px rgba(0,0,0,0.1);
            overflow-x: auto; /* Enable horizontal scrolling */
        }
        table {
            width: 100%;
            border-collapse: collapse;
            margin-bottom: 20px; /* Space between tables */
        }
        th, td {
            text-align: left;
            padding: 8px;
            border: 1px solid #ddd;
        }
        th {
            background-color: #f2f2f2;
        }
        audio {
            width: 100%;
            min-width: 150px; /* Minimum width for audio player */
        }
    </style>
</head>
<body>

<div class="container">
    <h1>T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning</h1>
    <h2>Zero-Shot TTS examples (Table 4).</h2>

    <p>Text input: Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.</p>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>Ours</th>
            <th>HierSpeech++</th>
            <th>WhisperSpeech</th>
            <th>XTTSv2</th>
            <th>StyleTTS2</th>
            <th>YourTTS</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/HierSpeech++.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/WhisperSpeech.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/XTTSv2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/StyleTTS2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence2/YourTTS.wav" type="audio/wav"></audio></td>
        </tr>
    </table>
    <p>Text input: Rodolfo meanwhile having returned home, and having missed the crucifix, guessed who had taken it, but gave himself no concern about it.</p>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>Ours</th>
            <th>HierSpeech++</th>
            <th>WhisperSpeech</th>
            <th>XTTSv2</th>
            <th>StyleTTS2</th>
            <th>YourTTS</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/HierSpeech++.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/WhisperSpeech.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/XTTSv2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/StyleTTS2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence1/YourTTS.wav" type="audio/wav"></audio></td>
        </tr>
    </table>

    <p>Text input: The railroads had not reached Jackson county, and wild game was plentiful on my father's farm on Big Creek near Lee's Summit.</p>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>Ours</th>
            <th>HierSpeech++</th>
            <th>WhisperSpeech</th>
            <th>XTTSv2</th>
            <th>StyleTTS2</th>
            <th>YourTTS</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/HierSpeech++.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/WhisperSpeech.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/XTTSv2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/StyleTTS2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence3/YourTTS.wav" type="audio/wav"></audio></td>
        </tr>
    </table>

    <p>Text input: Then he reappeared, creeping along the earth, from which his dress was hardly distinguishable, directly in the rear of his intended captive.</p>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>Ours</th>
            <th>HierSpeech++</th>
            <th>WhisperSpeech</th>
            <th>XTTSv2</th>
            <th>StyleTTS2</th>
            <th>YourTTS</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/HierSpeech++.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/WhisperSpeech.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/XTTSv2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/StyleTTS2.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Zero-Shot%20TTS/Sentence4/YourTTS.wav" type="audio/wav"></audio></td>
        </tr>
    </table>

    <h2>Ablation Study (Task)</h2>
    <h3>Examples sampled from evaluation data of Table 1. Number of iterations=1</h3>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>w/o CTC Correction, w/o Speech MLM</th>
            <th>w/o CTC Correction, w Speech MLM</th>
            <th>w CTC Correction, w Speech MLM</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence2/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/No%20CTC%20Correction,%20No%20speech%20MLM.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/No%20CTC%20Correction.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=1.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence1/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/No%20CTC%20Correction,%20No%20speech%20MLM.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/No%20CTC%20Correction.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=1.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence3/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/No%20CTC%20Correction,%20No%20speech%20MLM.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/No%20CTC%20Correction.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=1.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence4/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/No%20CTC%20Correction,%20No%20speech%20MLM.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/No%20CTC%20Correction.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=1.wav" type="audio/wav"></audio></td>
        </tr>
    </table>

    <h2>Ablation Study (Iterations)</h2>
    <h3>Examples sampled from evaluation data of Table 2</h3>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>Iters=1</th>
            <th>Iters=4</th>
            <th>Iters=8</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence2/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=1.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=8.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence1/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=1.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=8.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence3/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=1.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=8.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence4/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=1.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=8.wav" type="audio/wav"></audio></td>
        </tr>
    </table>

    <h2>Ablation Study (CFG weight)</h2>
    <h3>Examples sampled from evaluation data of Table 3</h3>
    <table>
        <tr>
            <th>Speaker Prompt</th>
            <th>CFG=0.0 (No CFG)</th>
            <th>CFG=1.0</th>
            <th>CFG=1.5</th>
            <th>CFG=2.0</th>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence2/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=4,%20CFG%20Weight=1.5.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence2/Iters=4,%20CFG%20Weight=2.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence1/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=4,%20CFG%20Weight=1.5.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence1/Iters=4,%20CFG%20Weight=2.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence3/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=4,%20CFG%20Weight=1.5.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence3/Iters=4,%20CFG%20Weight=2.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
            <td><audio controls><source src="samples/Ablation/Sentence4/prompt.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=4.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Ours.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=4,%20CFG%20Weight=1.5.wav" type="audio/wav"></audio></td>
            <td><audio controls><source src="samples/Ablation/Sentence4/Iters=4,%20CFG%20Weight=2.wav" type="audio/wav"></audio></td>
        </tr>
    </table>
</div>
</body>
</html>
