TACE: Token-Aware Chunked Encoding
Keywords: ASR, Streaming, Memory, Post-Training, Transformers
TL;DR: Post-training conversion for RoPE-based ASR that enables fixed-shape, parallel chunk-wise encoding and streaming via stitched encoder memory.
Abstract: RoPE-enabled Transformer encoder--decoders deliver strong ASR, but full-context self-attention scales quadratically with utterance length and real deployments face an additional systems bottleneck: highly variable audio durations induce highly variable tensor shapes, which is unfriendly to kernel autotuning and compilation. Prior ASR stacks often (i) rely on absolute positional embeddings that couple models to a fixed index space and encourage padding/bucketing to a small set of maximum lengths, or (ii) adopt relative schemes such as RoPE that extrapolate better, yet are typically served with full-context encoder passes whose variable shapes remain hard to optimize. We introduce Token-Aware Chunked Encoding (TACE), a post-training conversion that executes an existing ASR model on fixed-duration chunks, batches chunks for stable kernels, and deterministically stitches encoder states back into the original sequence so decoding is unchanged. To compensate for the loss of cross-chunk encoder attention, we post-train with parameter-efficient LoRA using teacher-alignment losses (encoder-state regression and logit distillation). TACE also provides a simple streaming contract: fixed-shape chunked encoding with append-only stitched memory, avoiding encoder recomputation as audio arrives. On 7 ESB corpora, $C{=}2$\,s and $L{=}0.5$\,s yields a 1.33$\times$ average encoder speedup (up to 2.05$\times$ on 25--30\,s utterances) while keeping normalized WER within 5 points of the base model.
Submission Number: 13
Loading