Cross-Modal Simplex Center Learning for Speech-Face Association  

在线阅读下载全文

作  者:Qiming Ma Fanliang Bu Rong Wang Lingbin Bu Yifan Wang Zhiyuan Li 

机构地区:[1]School of Information Network Security,People’s Public Security University of China,Beijing,100038,China

出  处:《Computers, Materials & Continua》2025年第3期5169-5184,共16页计算机、材料和连续体(英文)

基  金:funded by the Scientific Funding for China Academy of Railway Sciences Corporation Limited,China(No.2023YJ125).

摘  要:Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation.However,these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples,which are crucial for robust association modeling.To address these challenges,we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities.First,we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments.This shared pre-training step ensures the extraction of complementary identity information across modalities.Subsequently,we introduce a cross-modal simplex center loss,which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere.This design enforces an equidistant and balanced distribution of identity embeddings,reducing intra-class variations.Furthermore,we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability,enhancing the model’s ability to generalize across challenging scenarios.Extensive experiments validate the effectiveness of our framework,demonstrating superior performance across various speech-face association tasks,including matching,verification,and retrieval.Notably,in the challenging gender-constrained matching task,our method achieves a remarkable accuracy of 79.22%,significantly outperforming existing approaches.These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.

关 键 词:Speech-face association cross-modal learning cross-modal matching cross-modal retrieval 

分 类 号:O15[理学—数学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象