结合双流网络和双向五元组损失的跨人脸-语音匹配  被引量:1

Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss

在线阅读下载全文

作  者:柳欣[1,2,3] 王锐 钟必能 王楠楠 Liu Xin;Wang Rui;Zhong Bineng;Wang Nannan(College of Computer Science and Technology,Huaqiao University,Xiamen,Fujian 361021;State Key Laboratory of Integrated Services Networks(Xidian University),Xi an 710071;Xiamen Key Laboratory of Computer Vision and Pattern Recognition(Huaqiao University),Xiamen,Fujian 361021;School of Computer Science and Information Engineering,Guangxi Normal University,Guilin,Guangxi 541004)

机构地区:[1]华侨大学计算机科学与技术学院,福建厦门361021 [2]综合业务网理论及关键技术国家重点实验室(西安电子科技大学),西安710071 [3]厦门市计算机视觉与模式识别重点实验室(华侨大学),福建厦门361021 [4]广西师范大学计算机科学与信息工程学院,广西桂林541004

出  处:《计算机研究与发展》2022年第3期694-705,共12页Journal of Computer Research and Development

基  金:国家自然科学基金项目(61673185,61922066,61972167);综合业务网理论及关键技术国家重点实验室基金项目(ISN20-11);福建省自然科学基金项目(2020J01084);之江实验室开放课题(2021KH0AB01)。

摘  要:面部视觉信息和语音信息是人机交互过程中最为直接和灵活的方式,从而基于智能方式的人脸和语音跨模态感知吸引了国内外研究学者的广泛关注.然而,由于人脸-语音样本的异质性以及语义鸿沟问题,现有方法并不能很好地解决一些难度比较高的跨人脸-语音匹配任务.提出了一种结合双流网络和双向五元组损失的跨人脸-语音特征学习框架,该框架学到的特征可直接用于4种不同的跨人脸-语音匹配任务.首先,在双流深度网络顶端引入一种新的权重共享的多模态加权残差网络,以挖掘人脸和语音模态间的语义关联;接着,设计了一种融合多种样本对构造策略的双向五元组损失,极大地提高了数据利用率和模型的泛化性能;最后,在模型训练中进行ID分类学习,以保证跨模态表示的可分性.实验结果表明,与现有方法相比,能够在4个不同跨人脸-语音匹配任务上取得效果的全面提升,某些评价指标效果提升近5%.Facial information and voice cues are the most natural and flexible ways in human-computer interaction,and some recent researchers are now paying more attention to the intelligent cross-modal perception between the face and voice modalities.Nevertheless,most existing methods often fail to perform well on some challenge cross-modal face-voice matching tasks,mainly due to the complex integration of semantic gap and modality heterogeneity.In this paper,we address an efficient cross-modal face-voice matching network by using double-stream networks and bi-quintuple loss,and the derived feature representations can be well utilized to adapt four challenging cross-modal matching tasks between faces and voices.First,we introduce a novel modality-shared multi-modal weighted residual network to model the face-voice association,by embedding it on the top layer of our double-stream network.Then,a bi-quintuple loss is newly proposed to significantly improve the data utilization,while enhancing the generalization ability of network model.Further,we learn to predict identity(ID)of each person during the training process,which can supervise the discriminative feature learning process.As a result,discriminative cross-modal representations can be well learned for different matching tasks.Within four different cross-modal matching tasks,extensive experiments have shown that the proposed approach performs better than the state-of-the-art methods,by a large margin reaching up to 5%.

关 键 词:人脸-语音关联 跨模态感知 双流网络 双向五元组损失 加权残差网络 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程] TP391[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象