LLM based Text Generation for Improved Low-resource Speech Recognition Models
Abstract
Limited transcribed spoken style data is a critical bottleneck in building automatic speech recognition (ASR) systems for low-resource languages. Prompting a large language model (LLM) to paraphrase input text can generate novel text data that is constrained to be semantically similar to the source data. We leverage this capability of LLMs to improve the performance of low-resource ASR systems by increasing the limited text training data while keeping the same spoken style. Since word sequences in the training data are now more diverse and the vocabulary of the ASR model is also expanded, this approach allows for building general purpose ASR without prior knowledge of various domains in the low-resource language. In our experiments with Brazilian Portuguese as a low-resource language, paraphrased data enhanced the n-gram language model (LM) used to build the weighted finite state transducer (WFST) for decoding with a Conformer-CTC speech recognition model, resulting in improvement of word error rate (WER) by 15.6% over the baseline model. Synthesizing the paraphrased text into speech and using it to fine-tune the acoustic model (AM) component helped to further improve the WER by 2.9%, achieving a combined improvement of 18.5%. We also demonstrate the usefulness of our proposed approach for high-resource languages like English.