Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings

doi:10.1109/TASLP.2023.3301235

CORC > 自动化研究所 > 中国科学院自动化研究所 > 多模态人工智能系统全国重点实验室

	Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings
	Yi, Jiangyan 1; Tao, Jianhua 2,3; Fu, Ruibo 1; Wang, Tao 1; Zhang, Chu Yuan 1; Wang, Chenglong 4
刊名	IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING
	2023
卷号	31 页码:2963-2973
关键词	Adversarial training multi-task learning prosodic boundaries speech synthesis multi-modal embeddings
ISSN号	2329-9290
DOI	10.1109/TASLP.2023.3301235
通讯作者	Yi, Jiangyan(jiangyan.yi@nlpr.ia.ac.cn) ; Tao, Jianhua(jhtao@tsinghua.edu.cn)
英文摘要	boundaries are still crucial to the natural-ness of end-to-end speech synthesis systems. This article proposes to use adversarial multi-task learning to predict prosodic boundaries. Adversarial multi-task learning is utilized to transfer knowledge from an auxiliary POS tagging task to a prosodic boundary pre-diction task. Furthermore, multi-modal embeddings are composed of contextual word and speech embedding features obtained from the pre-trained bidirectional encoder representations from trans-formers (BERT) model and Speech2Vec. We can utilize linguistic and acoustic information from large amounts of external text and speech data without prosodic boundary labels. At the inference stage, the prosodic boundary predicting model can use the syntactic features learnt from the POS tagging task without any extra compu-tation cost due to only employing the prosodic boundary predicting task to decode. We conducted experiments on Mandarin datasets. The results show that the models using multi-modal embeddings from the pre-trained BERT and Speech2Vec outperform the mod-els trained with single modal embedding. Furthermore, the mod-els trained with adversarial training obtain further performance gains by up to 2.95% in F-1 score.
资助项目	National Natural Science Foundation of China (NSFC)[61831022] ; National Natural Science Foundation of China (NSFC)[U21B2010] ; National Natural Science Foundation of China (NSFC)[62101553] ; National Natural Science Foundation of China (NSFC)[61971419] ; National Natural Science Foundation of China (NSFC)[62006223] ; National Natural Science Foundation of China (NSFC)[62276259] ; National Natural Science Foundation of China (NSFC)[62201572] ; National Natural Science Foundation of China (NSFC)[62206278] ; Beijing Municipal Science and Technology Commission, Administrative Commission of Zhongguancun Science[Z211100004821013]
WOS关键词	SPEECH SYNTHESIS ; SEQUENCE ; MODEL
WOS研究方向	Acoustics ; Engineering
语种	英语
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
WOS记录号	WOS:001045259400002
资助机构	National Natural Science Foundation of China (NSFC) ; Beijing Municipal Science and Technology Commission, Administrative Commission of Zhongguancun Science
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/53906]
专题	多模态人工智能系统全国重点实验室
通讯作者	Yi, Jiangyan; Tao, Jianhua
作者单位	1.Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Arcial Intelligence Syst, Beijing 101408, Peoples R China 2.Tsinghua Univ, Dept Automat, Beijing 100190, Peoples R China 3.Univ Sci & Technol China, Sch Artificial Intelligence, Hefei 230026, Peoples R China 4.Univ Chinese Acad Sci, Beijing 101408, Peoples R China
推荐引用方式 GB/T 7714	Yi, Jiangyan,Tao, Jianhua,Fu, Ruibo,et al. Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING,2023,31:2963-2973.
APA	Yi, Jiangyan,Tao, Jianhua,Fu, Ruibo,Wang, Tao,Zhang, Chu Yuan,&Wang, Chenglong.(2023).Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings.IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING,31,2963-2973.
MLA	Yi, Jiangyan,et al."Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings".IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 31(2023):2963-2973.