ZIPA: A family of efficient models for multilingual phone recognition

Jian Zhu, Farhan Samir, Eleanor Chodroff, David R. Mortensen

Published: 01 Jan 2025, Last Modified: 07 Oct 2025ACL (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPA PACK++, a large-scale multilingual speech corpus with 17,000+ hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverages the efficient Zipformer backbones and outperforms existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000+ hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

External IDs:dblp:conf/acl/ZhuSCM25