Making Language Technologies Work Chichewa in Malawi

Dunstan Matekenya; Nadine Cyizere; Francis Majawa

Making Language Technologies Work Chichewa in Malawi

Dunstan Matekenya, Nadine Cyizere, Francis Majawa

02 Aug 2023 (modified: 02 Aug 2023)InvestinOpen 2023 OI Fund SubmissionEveryoneRevisions

Funding Area: Critical shared infrastructure / Infraestructura compartida critica

Problem Statement: Natural language processing (NLP) and natural language understanding (NLU) technologies such as automatic speech recognition, machine translation (MT), and conversational systems have improved dramatically over the past decades. Nowadays, it is extremely easy to interact with voice assistants such as Siri using speech recognition technologies as well as interact with generative AI models such as chat-GPT. However, this level of usefulness for generative language capabilities, acceptable quality machine translation, and usable speech to text models is limited to large languages such as English, Spanish, French and others. For small languages, often called low-resource languages, these language technologies which are taken for granted in developed countries are almost non-existent in low-income settings. For example, let’s take Chichewa, a language spoken by nearly 20 million people in Malawi (excluding Zambia and Mozambique where the language is also spoken). This language is a typical example of a low resource language because there is very little data online due to low-income settings. Because of this, the quality of machine translation (e.g., from English to Chichewa) is extremely low, BLEU score of less than 10% which according to Google means its almost useless. Automated speech recognition (ASR) and text to speech are non-existent. The lack of speech recognition capabilities means that illiterate people cannot interact with smart devices for crucial services.

Proposed Activities: There are three main pillars of activities for this work as follows: 1.Data collection for ASR and other NLP model improvement. It’s a widely accepted fact that lack of data is the main reason that language technologies are poor for low resource languages. Thus, data collection is of primary importance. 2.Research and experimentation (e.g. ASR models fine-tuning). Although there a numerous models trained on massive datasets across many languages which can be easily transferred to new languages, this often provide very low and unacceptable performance. In order to end up with a usable model, there is still need for substantial experimentation. In this regard, the second major aspect of this project is to utilize the data collected to fine-tune existing models (e.g., those being used in chat-GPT) to make them work better for Chichewa. 3.Documentation, awareness on community contribution to improve ASR. The data collection approach being pursued in this project is that which involves the general public or speakers of the language continuously contributing their data (e.g., voice notes in Chichewa). This type of grassroot based data collection is cheap and ensures sustained flow of data. However, to get this process started, there is need for upfront activities to bring to the attention of the public on why contributing such data is important. As such, these awareness activities will aim at promoting this spirit of sharing data for public good among Chichewa speakers. Data collection for ASR and other NLP model improvement: Although about 21 million people who speak Chichewa distributed across Malawi, Zambia and Mozambique, the language is still low resource because there isn’t a lot of digitized data on the internet which makes it hard to train machine learning models for NLP tasks such as machine translation and automated speech recognition. As such, one of the largest tasks for this project is to collect data which can be used to improve and fine-tune existing models and build new models specifically for this language. It’s worth noting that we have already collected 70 hours of audio data for Chichewa of which 20 hours have transcriptions. This data was collected through the Google NLP Hack Series: Intro to ASR Africa Challenge. The dataset is available on Zenodo. Since this competition ended in 2022, we have continued to collect audio data through grassroots based approaches such as holding virtual events to encourage speakers to donate their voice notes in Chichewa. Awareness on community contribution to improve ASR: At the core of this project is the grassroot based approach to data collection, testing and utilization of AI. The goal of this activity is to ensure that the general public who speaks Chichewa know that in order to bring these language technologies to their phones, they need to contribute. The community awareness activities will utilize social media, traditional media (e.g., radio including community radios) to bring awareness.

Openness: Since the inception of this project in 2022, it’s been open source. First, the project started at the Google NLP Hack Series: Intro to ASR Africa Challenge on the Zindi platform. The resulting dataset and documentation is publicly available on Zenodo. This project continued from this work effort has since setup a public GitHub page which provide documentation about ASR for Chichewa. Furthermore, the project conducted its first ‘voice note-sharing event on July 29, 2023’. The first ASR model for Chichewa fine-tuned from Whisper is publicly available on Huggingface so that anyone can use it. The plan is to always keep the products from this project open. Open source/grassroot based audio data collection. The model for collecting data to improve ASR and other NLP technologies for Chichewa is open source at heart. We started doing virtual ‘voice note sharing events’ where Chichewa speakers are encouraged to join and share their voice notes. Since people share a lot of voice notes on WhatsApp, we have setup a WhatsApp group to enable people join and share their voice notes. We publicize these events through a Facebook page. All data from this project is open. The Chichewa audio dataset and other NLP datasets (e.g., parallel machine translation corpus) will be made public once proper documentation is finished. Part of the data is already publicly available. The ASR models generated from this project are open. The documentation will be shared publicly on this GitHub repository.

Challenges: Financial resources constraints: Since this work is being done on a voluntary basis, it is difficult to find collaborators particularly in a low-income country like Malawi where most people don’t have the luxury to work on hobby stuff or work with no compensation. Unpredictable compute resources required to build and fine-tune models: Audio data is one of the bulkiest data out there. As such, building models with this data requires a lot of GPU resources which can be very hard to know in advance. Even in cases where this proposal is funded and we have some financial resources to purchase cloud compute to train models, the challenge is it is difficult to know in advance how much compute will be needed to achieve acceptable level of performance when training machine learning models. Lack of interest from the public and inadequate support form key stakeholders: This project relies a lot on the fact that with enough publicity, the public will come forward and contribute data in the form of voice notes as well as participate in data annotation (transcribing voice notes). Furthermore, we hope to have the public share a lot of documents/text in Chichewa. As for stakeholders, we have started discussions with key entities such as the Malawi Communications Regulatory Authority (MACRA) which is the ICT regulator in the country and Universities. A positive response and collaboration with these stakeholders is important for legitimacy of the project and gaining public interest.

Neglectedness: This project is reasonably new and so we haven’t applied for funding before as we were waiting to have preliminary results (e.g., simple model for speech to text in Chichewa) to start the fund-raising process. However, in the meantime, we are exploring several partnerships to either get funding and/or collaboration as follows: Google. We are in contact with Google who seem to support ASR efforts in sub-saharan Africa. There is no final outcome out this conversation yet. Opportunity international (OI) (contact person-Paul Essene). OI is a non-profit organization working in Malawi who would like to use ASR models in some of their work with rural farmers in Malawi. They are willing to provide support in kind (human resources and potentially financial resources). Malawi Communications Regulatory Authority (MACRA). We are in discussions with MACRA to start a collaboration. Malawi University of Science and Technology (MUST). This is a university in Malawi and we are in talks with the Data Science department to explore collaboration on ASR research as well as ‘data annotation events’ targeting students.

Success: There are three most important metrics of success as follows: 1000 hours of transcribed audio data. We are modelling this based on how other low- resource languages such as Kinyarwanda (see Kinyarwanda on Mozilla common voice) have done it. Now, Kinyarwanda has over 2000 hours of verified audio data. Our plan is to collect 1000 hours of verified and annotated (transcribed audio data). In addition, we hope to have more (e.g., 3000 hours) of untranscribed audio data which is still useful for training ASR models. Put Chichewa on Mozilla common voice platform. Having Chichewa on this platform would be a big milestone because it will accelerate audio data collection. Improve ASR models to achieve WER (word error rate) of 40% and below. At the moment the best performing model we have has WER close to 80% which is too high to be usable. Our goal is to have enough data and compute capacity to train the models so that the WER foes down and the models can be used in commercial and official settings. Additional, measures of success are as follows: Improve machine translation performance. In Malawi, although people use Google translate and chat-GPT, the quality of translations is very bad it cannot be used in official settings. Out tests showed that the BLEU score (popular metric for translation) for Chichewa to English translation is way below 20%. We intend to increase this to 30% or higher.

Total Budget: 25000

Budget File: pdf

Affiliations: None but the team lead is in contact with several stakeholders about collaboration on the project (please see neglectedness section)

LMIE Carveout: The language Chichewa which is the focus of this research work is spoken in Malawi which is a low-income country and one of the poorest countries. Since this work on the language of Chichewa whose variation is also spoken in neighboring Zambia and Mozambique, we anticipate that this work will also be useful for these neighboring countries. The core team working consists of the lead who is Malawian but currently working for the World Bank in Washington DC. The other team members are based in Malawi and Rwanda. All the community and contributors and most of the key stakeholders such as MACRA and local universities are based in Malawi. Another key stakeholder (Opportunity International) who intend to use our work soon is also based in Malawi.

Team Skills: Team composition Dunstan Matekenya-Team Lead. A Data Scientist working for the World bank currently based in Washington DC with 15 years’ experience. Before joining the World Bank, Dunstan worked as a Statistician at the statistical office in Malawi. He is a Malawian and have lived in Malawi up until 2017 when he joined the World Bank Group. He is passionate about preserving the local language in Malawi, enabling use of technology to broader population groups. He is also passionate about improving data science capacity in Africa and teaches a data science course at AIMS-Rwanda and AIMS Cameron. Francis Majawa-Team Member. He finished a MSc in Mathematical sciences at AIMS-Rwanda in June, 2023. He has background in web-design and he is the one who created the simple website for collecting audio data and testing the first ASR model for Chichewa. Nadine Cyizere Bisanukuli- Team Member. She also finished MSc in Mathematical sciences at AIMS-Rwanda in June, 2023. She did her Master thesis in computer vision. During the first ‘community voice data collection event’, Nadine helped with setting up of a platform for audio data annotation.

How Did You Hear About This Call: Word of mouth (e.g. conversations and emails from IOI staff, friends, colleagues, etc.) / Boca a boca (por ejemplo, conversaciones y correos electrónicos del personal del IOI, amigos, colegas, etc.)

Submission Number: 198

Loading