基于区域显著性与空间特征提取的说话人像合成方法  

Talking Portrait Synthesis Method Based on Regional Saliency and Spatial Feature Extraction

在线阅读下载全文

作  者:王邢波 张浩 高浩 翟明亮 谢九成 WANG Xingbo;ZHANG Hao;GAO Hao;ZHAI Mingliang;XIE Jiucheng(College of Automation&College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)

机构地区:[1]南京邮电大学自动化学院、人工智能学院,南京210023

出  处:《计算机科学》2025年第3期58-67,共10页Computer Science

基  金:国家自然科学基金(62301278,62371254,61931012);江苏省自然科学基金(BK20230362,BK20210594)。

摘  要:音频驱动的说话人像合成技术致力于将任意的输入音频序列转换为逼真的说话人像视频。近期,基于神经辐射场(NeRF)的多个说话人像合成工作取得了优秀的视觉效果。但是,此类工作仍普遍存在着语音-嘴唇同步欠佳、躯干抖动和合成视频清晰度较低等不足。为了解决上述问题,提出了一种基于区域显著特征和空间体积特征的高保真说话人像合成方法。具体而言,一方面,开发了一个区域显著性感知模块用于头部建模。它利用多模态输入信息动态调整头部空间点的体积特征,同时优化基于哈希表的特征存储,从而提高面部细节表征的精确度和渲染效率。另一方面,设计了一个空间特征提取模块用于躯干的独立建模。不同于现有方法普遍采用的直接基于躯干表面空间点坐标估计其颜色和密度的方式,该模块利用参考图像构建躯干场以提供对应的纹理和几何先验,从而实现更清晰的躯干渲染和自然的躯干运动。应用于多个人物主体的实验结果表明,在自我重建场景中,所提方法相较于当前最优的基线模型,在图像质量上(PSNR,LPIPS,FID,LMD)分别取得了10.15%,12.12%,0.77%和1.09%的提升,在嘴唇同步精度上(AUE)提高了14.20%。此外,在交叉驱动(使用非训练集音频)的场景下,该算法在嘴唇同步精度(AUE)上提升了4.74%。Audio-driven talking portraits synthesis endeavors to convert arbitrary input audio sequences into realistic talking portrait videos.Recently,several works on synthesizing talking portraits leveraging neural radiance fields(NeRF)have achieved superior visual results.However,such works still generally suffer from poor audio-lip synchronization,torso jitter,and low clarity in the synthesized videos.To address these issues,a method based on regional saliency features and spatial volume features is proposed to achieve high-fidelity synthesis of talking portraits.On one hand,a regional saliency-aware module is developed,dynamically adjusting the volumetric attributes of spatial points in the head region with multimodal input data and optimizing feature storage through hash tables,thus improving the precision and efficiency of facial detail representation.On the other hand,a spatial feature extraction module is designed for independent torso modeling.Unlike conventional methods that estimate color and density directly from torso surface spatial points,this module constructs a torso field using reference images to provide relevant texture and geometric priors,thereby achieving more precise torso rendering and natural movements.Experiments applied to multiple subjects demonstrate that,in self-reconstruction scenarios,the proposed method improves image quality(PSNR,LPIPS,FID,LMD)by 10.15%,12.12%,0.77%,and 1.09%respectively,and enhances lip-sync accuracy(AUE)by 14.20% compared to the current state-of-the-art baseline model.Concurrently,there is a notable increase of 14.20% in lip synchronization accuracy as measured by Sync metrics.Under cross-driving conditions with out-of-domain audio sources,the lip synchronization accuracy is achieved improvements of 4.74%.

关 键 词:说话人像合成 三维重建 音视频同步 神经辐射场 注意力机制 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象