Towards A Translative Model of Sperm Whale Vocalization

Orr Paradise; Liangyuan Chen; Pranav Muralikrishnan; Hugo Flores García; Bryan Pardo; Roee Diamant; David Gruber; Shane Gero; Shafi Goldwasser

Towards A Translative Model of Sperm Whale Vocalization

Orr Paradise, Liangyuan Chen, Pranav Muralikrishnan, Hugo Flores García, Bryan Pardo, Roee Diamant, David Gruber, Shane Gero, Shafi Goldwasser

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Keywords: Sperm Whale Communication, Bioacoustics, Masked Acoustic Token Modeling, Generative Audio Models, Representation Learning

TL;DR: WhAM: a transformer model unifying generation, acoustic translation and classification of sperm whale vocalizations

Abstract: Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham

Supplementary Material: zip

Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 18993

Loading