检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Qiming Ma Fanliang Bu Rong Wang Lingbin Bu Yifan Wang Zhiyuan Li
出 处:《Computers, Materials & Continua》2025年第3期5169-5184,共16页计算机、材料和连续体(英文)
基 金:funded by the Scientific Funding for China Academy of Railway Sciences Corporation Limited,China(No.2023YJ125).
摘 要:Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation.However,these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples,which are crucial for robust association modeling.To address these challenges,we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities.First,we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments.This shared pre-training step ensures the extraction of complementary identity information across modalities.Subsequently,we introduce a cross-modal simplex center loss,which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere.This design enforces an equidistant and balanced distribution of identity embeddings,reducing intra-class variations.Furthermore,we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability,enhancing the model’s ability to generalize across challenging scenarios.Extensive experiments validate the effectiveness of our framework,demonstrating superior performance across various speech-face association tasks,including matching,verification,and retrieval.Notably,in the challenging gender-constrained matching task,our method achieves a remarkable accuracy of 79.22%,significantly outperforming existing approaches.These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.
关 键 词:Speech-face association cross-modal learning cross-modal matching cross-modal retrieval
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.179