Language Preservation through ASR

Alexander O'Neill; Marieke Meelen; Rolando Coto-Solano; Sonam Phuntsog; Charles Ramble

doi:10.33774/coe-2023-rm6wq-v2

Language and Linguistics

Search within Language and Linguistics

Language Preservation through ASR

14 December 2023, Version 2

Poster

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

By the end of the century, over half of the 6500 languages spoken in the world will die out (Turin, 2007). Nepal's situation is particularly dire: of the 120+ distinct languages identified in the 2011 census, 60 are endangered due to globalisation, socio-political unrest, and environmental challenges. The loss of these languages also means the loss of unique cultural and religious identifiers. Given this, there is a need for methods and tools to preserve linguistic diversity. A major challenge in language preservation, however, is the transcription bottleneck (Shi et al., 2021): transcribing one minute of audio requires an average of 40+ minutes (Durantin et al., 2017). This becomes even more complicated for endangered languages with no (standardised) orthographies or documentation. While advanced automatic speech-recognition (ASR) tools are available, they are often ineffective for these extremely low-resource languages (Foley et al., 2018). This poster presents the preliminary results to address these issues for the Newar and Dzardzongke (both representing different branches of the Sino-Tibetan language family, spoken in Nepal) using Wav2Vec2 models fine-tuned for low-resource languages (Coto-Solano 2021, 2022). We show that endangered languages benefit from a specific set of optimisation procedures through tests with Kaldi vs Wav2Vec2; different types of data augmentation, and the development of a new or standardisation of orthography.

Keywords

Automatic Speech Recognition

Endangered Languages

Language Documentation

Tibetan

Newar

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Dec 14, 2023 Version 2

Nov 09, 2023 Version 1

Version Notes

Added funder and grant number

Metrics

325

Views

Downloads

Citations

License

The content is available under CC BY NC SA 4.0

DOI

10.33774/coe-2023-rm6wq-v2

Funding

Endangered Language Documentation Programme

G114548

Cambridge Centre for Digital Humanities

Incubator Grant 2023

MonlamAI

Arts and Humanities Research Council

AH/V011235/1

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Conference

Cambridge Language Sciences Annual Symposium 2023

Language Preservation through ASR

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Conference

Share