Decoupling_Pronunciation_and_Language_for_End-to-End_Code-Switching_Automatic_Speech_Recognition
Shuai Zhang1,2
2021-05-13
会议日期2021-6-11
会议地点Toronto, ON, Canada
英文摘要

Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models’ performance. In this paper, we propose a decoupled transformer model to use mono-lingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction.

内容类型会议论文
源URL[http://ir.ia.ac.cn/handle/173211/48821]  
专题模式识别国家重点实验室_智能交互
作者单位1.NLPR, Institute of Automation, Chinese Academy of Sciences, China
2.School of Artificial Intelligence, University of Chinese Academy of Sciences, China
推荐引用方式
GB/T 7714
Shuai Zhang. Decoupling_Pronunciation_and_Language_for_End-to-End_Code-Switching_Automatic_Speech_Recognition[C]. 见:. Toronto, ON, Canada. 2021-6-11.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace