检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:秦昊铭 卜凡亮[1] 钟方昊 马启明 QIN Haoming;BU Fanliang;ZHONG Fanghao;MA Qiming(School of Information Network Security,People's Public Security University of China,Beijing 100038,China)
机构地区:[1]中国人民公安大学信息网络安全学院,北京100038
出 处:《河南科技》2025年第6期22-30,共9页Henan Science and Technology
摘 要:【目的】为应对当前语音驱动的人脸图像生成方法在特征提取和生成质量方面的挑战,特别是解决音频与人脸特征之间深层联系的探索和利用不足问题,提出了一种基于梅尔频率倒谱系数(MFCC)的InceptionResNet-V1音频特征提取网络。【方法】通过SEGAN对音频信号进行数据增强,以实现特征的精细提取和有效传递。针对人脸图像生成质量问题,采用基于辅助分类器的生成对抗网络(AC-GAN)作为基线模型,并引入中值增强空间通道注意力模块(Median-enhancedSpatial and Channel Attention Block)以提升特征提取能力。同时,结合图像超分辨率重建模块,将生成的图像恢复为高分辨率图像。【结果】实验结果表明,所提方法在语音驱动的人脸图像生成任务中显著提升了生成质量,相较于主流模型FID降低了36%,余弦相似度提高了22%,人脸检索性能(Top-N)均有效提升,充分证明了其有效性和优越性。【结论】通过语音特征优化和注意力增强机制,有效提升了语音驱动人脸生成的精度与视觉效果,为跨模态生成任务提供了可扩展的技术路径。[Purposes]To address the challenges faced by current speech-driven facial image generation methods,particularly the insufficient exploration and utilization of deep audio-visual feature correlations,as well as limitations in feature extraction and generation quality,this paper proposes an InceptionResNet-V1 audio feature extraction network based on Mel Frequency Cepstrum Coefficient(MFCC).[Methods]The data of the audio signal is enhanced by SEGAN to achieve fine extraction and effective transmission of features.To improve facial image generation,we adopt an Auxiliary Classifier GAN(AC-GAN)as the baseline model,integrating a Median-enhanced Spatial and Channel Attention Block(MECS)to strengthen local feature alignment.Additionally,combined with the image super-resolution reconstruction module,the generated image is restored to a high-resolution image.[Findings]The experimental results demonstrate that the proposed method significantly enhances the quality of speech-driven facial image generation.Compared to mainstream models,it achieves a 36%reduction in FID and a 22%improvement in cosine similarity.Additionally,the face retrieval performance(Top-N)is consistently enhanced,fully validating its effectiveness and superiority.[Conclusions]By optimiz ing audio feature representation and introducing attention-enhanced mechanisms,this work effectively improves the precision and visual realism of speech-driven facial generation,offering a scalable techni cal framework for cross-modal generation tasks.
关 键 词:语音生成人脸 梅尔频率倒谱系数 生成对抗网络 注意力机制 图像超分辨率重建
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222