Paper Details
Abstract
Language models (LMs) play a crucial role in low-resource automatic speech recognition (ASR), particularly in code-switching scenarios where multiple languages are interleaved within a single utterance or conversation. These scenarios are often challenged by a lack of sufficient annotated training data. To mitigate this limitation, common strategies include leveraging multilingual pretraining, fine-tuning on linguistically related languages, and generating synthetic code-switched text from monolingual data. In this paper, we propose a novel method, Semi-Supervised Text Generation (SSTG), aimed at enhancing Mandarin-English code-switching speech recognition (CSSR). Our approach utilizes a semi-supervised acoustic model to generate synthetic code-switched transcriptions from untranscribed audios. The SEAME Mandarin-English code-switching corpus is used as supervised training data, while Part IV of the National Speech Corpus (NSC) serves as the source of untranscribed input. Experimental results demonstrate that the quality of the generated text is comparable to that of manually annotated transcripts, highlighting the effectiveness of our approach in improving language modeling for code-switching speech recognition.