Language Preservation through ASR

14 December 2023, Version 2
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

By the end of the century, over half of the 6500 languages spoken in the world will die out (Turin, 2007). Nepal's situation is particularly dire: of the 120+ distinct languages identified in the 2011 census, 60 are endangered due to globalisation, socio-political unrest, and environmental challenges. The loss of these languages also means the loss of unique cultural and religious identifiers. Given this, there is a need for methods and tools to preserve linguistic diversity. A major challenge in language preservation, however, is the transcription bottleneck (Shi et al., 2021): transcribing one minute of audio requires an average of 40+ minutes (Durantin et al., 2017). This becomes even more complicated for endangered languages with no (standardised) orthographies or documentation. While advanced automatic speech-recognition (ASR) tools are available, they are often ineffective for these extremely low-resource languages (Foley et al., 2018). This poster presents the preliminary results to address these issues for the Newar and Dzardzongke (both representing different branches of the Sino-Tibetan language family, spoken in Nepal) using Wav2Vec2 models fine-tuned for low-resource languages (Coto-Solano 2021, 2022). We show that endangered languages benefit from a specific set of optimisation procedures through tests with Kaldi vs Wav2Vec2; different types of data augmentation, and the development of a new or standardisation of orthography.

Keywords

Automatic Speech Recognition
Endangered Languages
Language Documentation
Tibetan
Newar

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.