Multilingual Accessibility of Open Science & Education thanks to Automated Translation

30 Jul 2023 (modified: 01 Aug 2023)InvestinOpen 2023 OI Fund SubmissionEveryoneRevisionsBibTeX
Funding Area: Critical shared infrastructure / Infraestructura compartida critica
Problem Statement: Open Education initiatives like our platform Enabla (https://enabla.com) are gaining ground throughout the planet as key enablers of opportunities, in particular for underprivileged learners with limited access to high-quality studies. The vehicular language of Open Science, however, is English, which is rarely the mother tongue of the teacher or the student and, therefore, often weakens the quality of the communication. Enabla provides, on the one hand, freely accessible unique high-end educational content from active scientists and, on the other hand, a rare occasion to dialogue with them on the platform, but only in English. Language is a major barrier for anyone without international experience, which adds to the difficult learning process for students and affects the pedagogical skills of the lecturers. By providing automated translation powered by artificial intelligence on our own Open Education platform, we wish to help both enthusiastic teachers and intimidated students connect intellectually.
Proposed Activities: Within this proposal, we plan to implement the following translation-related features: translation of user comments & reviews (difficulty: low), translation of lecture notes (difficulty: intermediate), automatic subtitles generation & their translation in video lectures (difficulty: intermediate), and automatically translated voice-over in video lectures (difficulty: high). In the first phase of development, we will investigate and compare existing solutions (e.g., open-source libraries and commercial products) that can help us to implement the proposed features, balancing their price, openness, and quality. Our IT development team has experience in personalizing Open Source resources for the specific needs of the platform, and we expect to find a suitable approach within three months of preparation. Thanks to the diversity of our IT team (cf. Team Skills), we will start the concrete implementation of each feature as soon as its respective research is over. The 8-month timelines of each feature will then overlap to cover around 18 months for the total project, and they will each unfold with the following basic structure: Research phase (1 month): design of the comparison criteria, exploring the landscape and searching for the most promising candidates, and the in-depth study of the selected tools UI/UX design phase (1 month) Backend design & implementation; fine-tuning of the ML models (3 months) Frontend implementation (2 months) Manual review of the new interface & feature (1 month) At the end of this period, we will carry out the first large-scale test in the form of a massive translation into several major languages of all the lecture notes and user-generated content currently hosted on the platform. Our platform hosts around 300 hours of scientific lectures, which will undergo this process with the authors’ permission. To account for unforeseen delays and difficulties, we would add two months and bring the estimated timeframe of the whole proposal to 20 months.
Openness: All content produced by lecturers and users and hosted on Enabla is freely available to everyone under Creative Commons licenses. The translation of this content as a result of this proposal will follow the same rules. If time permits, we would design and implement a user feedback loop to correct mistakes and improve the quality of machine translation over time. The community of users extends at a similar rate as the increase in international scientific schools & workshops that publish their lectures on Enabla. We entertain partnerships with Open Access publishing houses to improve our visibility and reach a larger audience. Enabla strives to communicate and exchange with students and early-career researchers on social media (https://twitter.com/EnablaTeam, https://www.linkedin.com/company/enabla-edu, https://mathstodon.xyz/@enabla) to attract their attention and identify relevant centers of interest for future lectures (see also https://www.scientifyresearch.org/blog/enabla-interactive-inclusive-open-science-lectures/). With the new automatic translation feature, we will extend our outreach to South American universities thanks to our international network of researchers. We believe that automatic translation is most probably unique among similar lecture publishing platforms and that it will stimulate more institutions to join the Open Science movement with us.
Challenges: It is known (https://www.sciencedaily.com/releases/2022/08/220816175107.htm, https://www.defined.ai/blog/lost-in-translation-how-artificial-intelligence-is-breaking-the-language-barrier/) that scientific texts and speeches are harder to translate and/or recognize than daily conversations. Difficulties are also expected in the speech recognition of videos by non-native English speakers, while the task should be simpler when providing the algorithm with the corresponding lecture notes. For this reason, we will need to pay attention to the results of the automatic translation generated by external software and fine-tune their models on science-specific databases. In case there is no other choice, we are ready to manually fine-tune the models using our own lectures as a training set; this will require a careful comparison of hours of lectures with their autogenerated transcripts & translations.
Neglectedness: After several months of investigation around AI-powered Open Education, we mostly found large-scale fundings for well-established companies in Open Science (Arcadia, Hewett, European Commission) but could not identify small fundings for start-ups. Local fundings at a national level are stricter, and we did not apply there due to incompatibilities in location or purposes. Earlier (unsuccessful) attempts to get funding for other aspects of our Open Education platform did not involve automatic translation because it has been just enabled this year with breakthroughs from OpenAI.
Success: At the end of the first phase of research, we want to have a clear description of the mechanism (IT development) required to implement our four features (cf. Proposed Activities), including the optimal external resources and the missing elements to be added from our side. The second phase on the design of the UI is materialized as a map. The third and fourth phases (backend-frontend dialogue) will progressively appear on the website and become available to users. The final phase will involve massive machine translation of the platform's content, i.e. texts, comments, scientific articles, and video lectures. To assess success qualitatively, we will ask our users to provide feedback on the translated lectures. To assess success quantitatively, we will randomly pick a few lectures and manually check their translation quality. We will use two measures: a rate of edit-requiring sentences and a rate of mistakes in the translation or recognition of domain-specific terms. The project will be considered successful if the human feedback is positive, the rate of edit-requiring sentences is below 1%, and the frequency of errors in domain-specific terms is below 5%.
Total Budget: 21890$
Budget File: pdf
Affiliations: Enabla.com OÜ
LMIE Carveout: Our project mainly targets members of the scientific community who do not feel comfortable with the English language, a situation thus most often encountered in LMIEs. In fact, we already have a large part of our traffic coming from LMIE countries (e.g., India) thanks to a collaboration with local lecturers. To make sure that our project will reach its target public, we will increase our outreach activities and community development through the scientific network of universities and research institutes that our team members have built (e.g., South America and India). We will also implement Search Engine Optimization (SEO) to make our translated lectures easily searchable and accessible for those who need them most.
Team Skills: For the translation-related work, we plan to devote a sub-team consisting of 5 people: one backend developer, one frontend developer, one software testing engineer, one UI/UX designer, and one manager. Since Enabla has been written and maintained solely by our team, the platform itself can be considered a showcase of our team’s expertise. While the project is guided by trained physicists, the core development team consists of professional developers working full-time in leading European companies and devoting their spare time to Enabla. In terms of product development and outreach, we can count on 3 (future and current) Ph.D. in Science trained in Germany and Italy, on their international networks across the Max Planck Society and the ICTP, Trieste, and on business counseling from the Max Planck Foundation. Note that contact with the ICTP gives us an easy communication channel with representatives of third-world countries where our work is most relevant.
How Did You Hear About This Call: Word of mouth (e.g. conversations and emails from IOI staff, friends, colleagues, etc.) / Boca a boca (por ejemplo, conversaciones y correos electrónicos del personal del IOI, amigos, colegas, etc.)
Submission Number: 93
Loading