IPA-CLIP: Integrating Phonetic Priors into Vision and Language PretrainingDownload PDF

Anonymous

17 Apr 2023ACL ARR 2023 April Blind SubmissionReaders: Everyone
Abstract: Large-scale Vision and Language (V\&L) pretraining has recently become the standard backbone of multimedia systems.While it has shown remarkable performance even in zero-shot scenarios, it often performs in ways not intuitive to humans.Particularly, they do not consider the pronunciation of the input, which humans would utilize to process language. Thus, this paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP), one of the V\&L pretrained models, to make it consider the pronunciation similarity among its language inputs.To achieve this, we first propose a phoneme embedding that uses the phoneme relationships on the International Phonetic Alphabet (IPA) chart as a phonetic prior.Next, by distilling the CLIP text encoder, we train a pronunciation encoder employing the IPA-based embedding. The proposed model named IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text).Quantitative evaluations show that IPA-CLIP accurately processes words in a more phonetic manner, which is promising for downstream tasks. A qualitative evaluation verifies a high correlation to human perception regarding pronunciation similarity.
Paper Type: long
Research Area: Phonology, Morphology and Word Segmentation
0 Replies

Loading