Rethinking morphosyntactic annotations with automatic prosodic labelling: Test-driving a novel intonosyntactic treebank format on Nigerian Pidgin

Emmett Strickland

Rethinking morphosyntactic annotations with automatic prosodic labelling: Test-driving a novel intonosyntactic treebank format on Nigerian Pidgin

Emmett Strickland

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0

Keywords: treebank, prosody, syntax, creoles

Working Group: WG1: Corpus annotation, WG4: Quantifying and promoting diversity

WG1 Tasks: Task 1.5: Annotation of Spoken data, Task 1.4: Sharing tools, formats, and infrastructure

Abstract: Nigerian Pidgin, or Naijá, is a low-resource creole language spoken by as many as 110 million people in West Africa. Historically stigmatized and lacking any official status in administration, Naijá is relatively understudied compared to other languages of its size. This provides fertile ground for new tools to better understand the grammar and phonology of one of the world's fastest-growing languages. This submission presents a demonstration of a recently published treebank annotation scheme developed for Nigerian Pidgin, which allows for the joint study of syntax and prosody with public-facing tools like Grew-Match. This NaijaSynCor-Prosody treebank notably combines traditional syntactic dependency annotations with a detailed layer of phonetic annotations describing every syllable of every token. In this contribution, we will provide an interactive demo of our annotation scheme, which is fully language-independent and has since been applied to a corpus of French. We will also use a combination of phonetic and morphosyntactic annotations to reveal distinct prosodic categories within part of speech groups, highlighting how phonetic information can be leveraged to reinterpret morphosyntactic categories. Our work is based on the NaijaSynCor treebank, developed over the course of a 2017-2021 project funded by the French National Research Agency. The original corpus was primarily encoded as a syntactic treebank of transcribed spontaneous speech, with every utterance represented as a dependency tree in the Surface-Syntactic Universal Dependencies (SUD) annotation scheme. Regrettably, this format excluded the multitude of phonetic information in the original field recordings. Our corpus addresses this gap by associating every orthographically-transcribed token in the original NaijaSynCor treebank with separate nodes describing each syllable. These notably include: A SAMPA phonetic transcription The shape of the syllable's pitch contour Mean F0 and various normalizations Duration and various normalizations Loudness and various normalizations Each of these annotations is encoded in the same .conllu tabular data format as the pre-existing syntactic annotations, allowing users to access both syntactic and prosodic labels using the same Grew-Match query interface. Concretely, this allows users to study how prosody and syntax interact by viewing which syntactic labels are most strongly correlated with which prosodic annotations. During our demo, we will use these annotations and query tools to revisit longstanding questions about the prosodic typology of Nigerian Pidgin. Naijá has traditionally been described as a tone language in which pitch can be used to distinguish a handful of mostly monosyllabic minimal pairs like gò 'future' and gó 'go'. Consulting this richly-annotated corpus of spontaneous speech sheds new light on these analyses. In particular, we show that while such minimal pairs exhibit clear pitch differences, they are also associated with differences in duration and intensity reminiscent of stress-accent languages. The demo also shows how this format can help to optimize corpora more generally. For example, we will see that a single part of speech category can contain wildly different prosodic profiles during our exploration of Naijá auxiliaries. Our prosodic labels reveal two broad categories hidden beneath the original syntactic labels: a low-pitched and low-duration group composed of bin 'past', dey 'imperfective', and go 'future'; and a mostly high-pitched and high-duration group composed of con 'consecutive', don 'perfective', make 'subjunctive' fit 'ability', and no 'negative'. We will also argue that some prosodic groupings may correspond to separate syntactic categories. One major contribution of this corpus format is therefore allowing annotators to use directly accessible prosodic information to inform their syntactic annotations. We will also discuss how prosodic labels might be used to better annotate otherwise ambiguous constructions where more than one label is possible. For example, both the compound and modifier relations can link a noun to an adjective depending on how lexicalized the construction is. Our methods reveal a potential prosodic difference between these two constructions, an approach we believe can be used to better understand serial verbs, multi-word expressions, and other constructions.

WG4 Tasks: Task 4.1: Promoting low-resourced/endangered languages

Tracks For Type Of Contribution: Complete work (including previously published work)

Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 68

Loading