Redefining Machine Simultaneous Interpretation — README

Overview
========
This repository provides two components:

1) Model Adaptation for Simultaneous Interpretation
   - Few-shot inference: add language-pair examples and run.
   - DICL (Dynamic In-Context Learning): build a categorized/clustered translation-pair library (keyword- or embedding-based), then run dynamic_inference.py for inference.
   - Note: TransLLaMA is NOT included here because it is already publicly released elsewhere. Please refer to the authors’ official repository if you need it.

2) Latency-Aware TTS Pipeline
   - From source audio word-level timestamps → source–target word alignment → WAIT-token insertion → target-language timetable generation → TTS synthesis → second-level Average Lagging (AL) evaluation.

Important Note on Excluded Components
=====================================
- Salami Segmentation: The salami segmentation code, including the prompt and the workflow that uploads a JSONL file to the OpenAI platform to run batch jobs with the gpt-4o model, is taken from the referenced article and is NOT provided in this repository.
- TransLLaMA: Also not included (see above).

Project Structure (key scripts)
===============================
Adjust paths/options at the top of each script, then run with `python <script>.py`. The repo uses a minimal-args style.

DICL/
  dynamic_inference.py          # DICL inference entry (uses categorized/clustered example library)
  generate_fewshot_by_type.py   # build few-shot example sets by keyword / category
  generate_fewshot_embedding.py # build few-shot example sets via embeddings + clustering

Few-shot/
  generate_translation.py       # simple few-shot inference; add your language-pair examples in this folder

TTS_pipeline/
  timestamps.py                 # source-side word-level timestamps
  word_alignment_zh.py          # source→Chinese word alignment
  word_alignment_de.py          # source→German word alignment
  insert_wait_token.py          # insert <WAIT> tokens on target side based on alignment
  get_timetable.py              # produce target-language speaking timetable
  TTS_zh.py                     # Chinese TTS (Use CozyVoice 2, edit model paths inside)
  TTS_de.py                     # German TTS (Use Tacotron 2, edit model paths inside)
  AL_evaluation.py              # compute second-level Average Lagging (AL)

Quick Start
===========

(A) Model Adaptation
-------------------
(A1) Few-shot (static examples)
1. Add your language-pair examples to Few-shot/examples/ (keep the same simple format you already use).
2. In your inference script (e.g., dynamic_inference.py), set at the top:
   - Base model path (e.g., Qwen3-8B)
   - Path to the examples file
   - Input source text path (word-by-word if that’s your setup)
   - Output directory
3. Run:
   python Few-shot/generate_translation.py

(A2) DICL (Dynamic In-Context Learning)
Two stages: build a retrieval DB → dynamic inference.

Build a retrieval DB (choose one):
- Keyword method: extract keywords/labels per (src, tgt) pair and group them into categories.
- Embedding method: encode (src, tgt) with a sentence embedding model, then cluster (e.g., K-Means) and save cluster_id with each pair.

Recommended output format for the DB: JSONL, one object per line, e.g.:
{"src": "...", "tgt": "...", "label": "xxx"}
or
{"src": "...", "tgt": "...", "cluster_id": 3}

Dynamic inference:
- In DICL/dynamic_inference.py, set at the top:
  RETRIEVAL_DB, INPUT_SRC, OUTPUT_DIR, MODE ("keywords" or "embedding"), TOPK, etc.
- Run:
  python DICL/dynamic_inference.py

(B) Latency-Aware TTS Pipeline
-----------------------------
Overall flow:
1) timestamps.py → 2) word_alignment_zh.py / word_alignment_de.py →
3) insert_wait_token.py → 4) get_timetable.py → 5) TTS_zh.py / TTS_de.py → 6) AL_evaluation.py

(B1) Word-level timestamps (source audio)
- Edit paths and model choices in tts_pipeline/timestamps.py, then:
  python tts_pipeline/timestamps.py
- Output: *.timestamps.json (or .jsonl) with word-level timing, e.g.:
  {
    "sentence": "Good morning!",
    "words": [
      {"word":"Good","start":0.20,"end":0.40},
      {"word":"morning!","start":0.40,"end":0.84}
    ]
  }

(B2) Word alignment (source→target)
- Chinese:
  python tts_pipeline/word_alignment_zh.py
- German:
  python tts_pipeline/word_alignment_de.py
- Output: *.alignment.jsonl, each line stores alignment for one (src, tgt) pair.

(B3) WAIT-token insertion
- Run:
  python tts_pipeline/insert_wait_token.py
- Input: *.alignment.jsonl
- Output: *.alignment.wait.jsonl (target-side text with <WAIT> tokens inserted).

(B4) Target timetable generation
- Run:
  python tts_pipeline/get_timetable.py
- Inputs: *.timestamps.json (from B1) and *.alignment.wait.jsonl (from B3)
- Output: *.timetable.json, e.g.:
  [
    {"text": "并且它相当", "start": 0.08},
    {"text": "容易",       "start": 0.80},
    {"text": "实现，并且可以推广到许多不同的复杂问题。", "start": 1.68}
  ]

(B5) TTS synthesis
- Chinese (e.g., CozyVoice):
  python tts_pipeline/TTS_zh.py
- German (e.g., Tacotron 2):
  python tts_pipeline/TTS_de.py
- Input: *.timetable.json
- Output: synthesized target audio (.wav)

(B6) Second-level AL evaluation
- Run:
  python tts_pipeline/AL_evaluation.py
- Inputs: source word timing (B1) and target timing (from synthesized audio or timetable)
- Output: per-sample AL (seconds) and summary (CSV/JSON).

Dependencies (minimal)
======================
Use a single consolidated list (you do not need to list per-file dependencies). Adjust versions to your environment.

Core
- Python >= 3.10
- torch
- transformers
- sentence-transformers      # for DICL
- simalign                   # word-level alignment
- tqdm, numpy, pandas

Audio / Speech
- whisperx OR openai-whisper # word-level timestamps
- pydub, torchaudio, librosa, soundfile, ffmpeg

Chinese text processing (if used by your scripts)
- jieba

Optional
- sacrebleu, matplotlib

External models / resources (set local paths inside scripts)
- Base LLM (e.g., Qwen3-8B) for few-shot/DICL inference
- Chinese TTS: CozyVoice (MODEL_PATH, SPEAKER, etc.)
- German TTS: Tacotron 2 (CHECKPOINT_PATH, etc.)
- Whisper / WhisperX models (e.g., large-v2)

Data & Formats (recommendations)
================================
- Source input: word-level tokenized English (space-delimited if that’s your convention).
- Few-shot/DICL examples: in-domain (src, tgt) pairs. For DICL, include keyword labels or cluster ids.
- Source audio: mono, 16 kHz (the scripts may resample if needed).

FAQ
===
Q1: Why is TransLLaMA not included?
A1: It already has a public implementation released by the authors. This repository focuses on few-shot and DICL inference; for SFT training or full pipelines, please consult the authors’ official codebase.

Q2: Do I have to list dependencies for each script?
A2: No. A single consolidated dependency list (see above) plus the external model checkpoints is enough.

Q3: Where should I configure paths?
A3: At the top of each script. This repository intentionally minimizes command-line arguments to keep usage simple.

Repro Tips
==========
- Keep sampling rate and channels consistent (e.g., 16 kHz mono) to avoid alignment drift.
- Clean source text before alignment (remove fillers if needed) to stabilize WAIT insertion.
- For DICL, example quality controls performance: ensure keyword coverage / representative clusters.
- Mind GPU memory with WhisperX and TTS; adjust batch sizes conservatively.

License & Acknowledgments
=========================
- Please honor the licenses of any third-party tools/models you use (Whisper/WhisperX, SimAlign, CozyVoice, Tacotron 2, sentence-transformers, etc.).
- Salami segmentation, the associated prompt, and the OpenAI gpt-4o batch workflow are attributed to the referenced work and are not included here.
- TransLLaMA is also attributed to its authors and not included here.

External References
===================
- TransLLaMA (official GitHub): https://github.com/RomanKoshkin/transllama
- SimulST / Simul-MUST-C (official GitHub): https://github.com/naist-nlp/SimulST
