基于生成对抗网络的语音画像方法

Speech Portrait Method Based on Generative Adversarial Networks

作　　者：秦昊铭卜凡亮[1] 钟方昊马启明 QIN Haoming;BU Fanliang;ZHONG Fanghao;MA Qiming(School of Information Network Security,People's Public Security University of China,Beijing 100038,China)

机构地区：[1]中国人民公安大学信息网络安全学院,北京100038

出　　处：《河南科技》2025年第6期22-30,共9页Henan Science and Technology

摘　　要：【目的】为应对当前语音驱动的人脸图像生成方法在特征提取和生成质量方面的挑战,特别是解决音频与人脸特征之间深层联系的探索和利用不足问题,提出了一种基于梅尔频率倒谱系数(MFCC)的InceptionResNet-V1音频特征提取网络。【方法】通过SEGAN对音频信号进行数据增强,以实现特征的精细提取和有效传递。针对人脸图像生成质量问题,采用基于辅助分类器的生成对抗网络(AC-GAN)作为基线模型,并引入中值增强空间通道注意力模块(Median-enhancedSpatial and Channel Attention Block)以提升特征提取能力。同时,结合图像超分辨率重建模块,将生成的图像恢复为高分辨率图像。【结果】实验结果表明,所提方法在语音驱动的人脸图像生成任务中显著提升了生成质量,相较于主流模型FID降低了36%,余弦相似度提高了22%,人脸检索性能(Top-N)均有效提升,充分证明了其有效性和优越性。【结论】通过语音特征优化和注意力增强机制,有效提升了语音驱动人脸生成的精度与视觉效果,为跨模态生成任务提供了可扩展的技术路径。[Purposes]To address the challenges faced by current speech-driven facial image generation methods,particularly the insufficient exploration and utilization of deep audio-visual feature correlations,as well as limitations in feature extraction and generation quality,this paper proposes an InceptionResNet-V1 audio feature extraction network based on Mel Frequency Cepstrum Coefficient(MFCC).[Methods]The data of the audio signal is enhanced by SEGAN to achieve fine extraction and effective transmission of features.To improve facial image generation,we adopt an Auxiliary Classifier GAN(AC-GAN)as the baseline model,integrating a Median-enhanced Spatial and Channel Attention Block(MECS)to strengthen local feature alignment.Additionally,combined with the image super-resolution reconstruction module,the generated image is restored to a high-resolution image.[Findings]The experimental results demonstrate that the proposed method significantly enhances the quality of speech-driven facial image generation.Compared to mainstream models,it achieves a 36%reduction in FID and a 22%improvement in cosine similarity.Additionally,the face retrieval performance(Top-N)is consistently enhanced,fully validating its effectiveness and superiority.[Conclusions]By optimiz ing audio feature representation and introducing attention-enhanced mechanisms,this work effectively improves the precision and visual realism of speech-driven facial generation,offering a scalable techni cal framework for cross-modal generation tasks.

关键词：语音生成人脸梅尔频率倒谱系数生成对抗网络注意力机制图像超分辨率重建

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于生成对抗网络的语音画像方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于生成对抗网络的语音画像方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索