Cross-Modality Synergy Network for Referring Expression Comprehension and Segmentation

doi:10.1016/j.neucom.2021.09.066

CORC > 自动化研究所 > 中国科学院自动化研究所 > 复杂系统管理与控制国家重点实验室 > 先进机器人控制团队

	Cross-Modality Synergy Network for Referring Expression Comprehension and Segmentation
	Li, Qianzhong 1,2; Zhang, Yujia 1; Sun, Shiying 1; Wu, Jinting 1,2; Zhao, Xiaoguang 1; Tan, Min 1
刊名	Neurocomputing
	2022-01-07
卷号	467 期号:/页码:99-114
关键词	Referring expression comprehension Referring expression segmentation Cross-modality synergy Attention mechanism
ISSN号	0925-2312
DOI	10.1016/j.neucom.2021.09.066
英文摘要	Referring expression comprehension and segmentation aim to locate and segment a referred instance in an image according to a natural language expression. However, existing methods tend to ignore the interaction between visual and language modalities for visual feature learning, and establishing a synergy between the visual and language modalities remains a considerable challenge. To tackle the above problems, we propose a novel end-to-end framework, Cross-Modality Synergy Network (CMS-Net), to address the two tasks jointly. In this work, we propose an attention-aware representation learning module to learn modal representations for both images and expressions. A language self-attention submodule is proposed in this module to learn expression representations by leveraging the intra-modality relations, and a language-guided channel-spatial attention submodule is introduced to obtain the language aware visual representations under language guidance, which helps the model pay more attention to the referent-relevant regions in the images and relieve background interference. Then, we design a cross-modality synergy module to establish the inter-modality relations for modality fusion. Specifically, a language-visual similarity is obtained at each position of the visual feature map, and the synergy is achieved between the two modalities in both semantic and spatial dimensions. Furthermore, we propose a multi-scale feature fusion module with a selective strategy to aggregate the important information from multi-scale features, yielding target results. We conduct extensive experiments on four challenging benchmarks, and our framework achieves significant performance gains over state-of-the-art methods. (c) 2021 Elsevier B.V. All rights reserved.
资助项目	National Key Research and Development Project of China[2019YFB1310601] ; National Key R&D Program of China[2017YFC0820203-03] ; National Natural Science Foundation of China[62103410]
WOS研究方向	Computer Science
语种	英语
出版者	ELSEVIER
WOS记录号	WOS:000710121100009
资助机构	National Key Research and Development Project of China ; National Key R&D Program of China ; National Natural Science Foundation of China
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/46309]
专题	自动化研究所_复杂系统管理与控制国家重点实验室_先进机器人控制团队
通讯作者	Zhang, Yujia
作者单位	1.Institute of Automation, Chinese Academy of Sciences 2.Universigy of Chinese Academy of Sciences
推荐引用方式 GB/T 7714	Li, Qianzhong,Zhang, Yujia,Sun, Shiying,et al. Cross-Modality Synergy Network for Referring Expression Comprehension and Segmentation[J]. Neurocomputing,2022,467(/):99-114.
APA	Li, Qianzhong,Zhang, Yujia,Sun, Shiying,Wu, Jinting,Zhao, Xiaoguang,&Tan, Min.(2022).Cross-Modality Synergy Network for Referring Expression Comprehension and Segmentation.Neurocomputing,467(/),99-114.
MLA	Li, Qianzhong,et al."Cross-Modality Synergy Network for Referring Expression Comprehension and Segmentation".Neurocomputing 467./(2022):99-114.