检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘晶晶 黄浩[1] LIU Jingjng;HUANG Hao(School of Information Science and Engineering,Xinjiang University,Urumqi 830017,China)
机构地区:[1]新疆大学信息科学与工程学院,乌鲁木齐830017
出 处:《计算机工程》2023年第3期128-133,160,共7页Computer Engineering
基 金:国家重点研发计划(2020AAA0107902);国家自然科学基金(61663044,61761041);新疆多语种信息技术重点实验室开放课题(2020D04047)。
摘 要:基频或基音的估计是各种语音信号处理技术的关键子问题,现有信号处理技术研究多使用数据驱动的方法,即通过卷积神经网络进行基频提取。然而,卷积神经网络中的卷积操作一次只能处理局部的音频样本点,只有在递归应用卷积操作时才能捕获全局音频样本点依赖关系,导致计算效率低与优化困难。受非局部模块在计算机视觉任务中具有较高性能的启发,提出一种具有非局部模块的卷积神经网络用于基频提取任务。非局部模块相比不断堆叠的卷积神经网络,可以直接计算两个位置之间的关系,由于其可以忽略欧氏距离,因此能够快速捕获长范围的依赖关系。对于基频估计任务,可在卷积神经网络中加入非局部模块以计算音频样本点之间的相似性,有助于捕获帧与帧和样本点与样本点之间的全局依赖关系,且非局部模块可以保持输入输出维度不变,能够快速地集成卷积神经网络。实验结果表明,该方法平均绝对误差仅为4.7,与基线模型相比,至少降低了0.7,能够获得最佳的模型性能。Estimating the fundamental frequency or pitch is a key sub-problem in various speech signal processing techniques.Recent studies use a data-driven approach,namely,fundamental frequency extraction with Convolutional Neural Network(CNN).However,the convolution operation in CNN can only process local audio sample points at a given time,and the global audio sample point dependencies can only be captured when the convolution operation is applied recursively.However,this introduces computational inefficiency and optimization difficulties.Inspired by the impressive performance of non-local modules in many computer vision tasks,this study proposes a CNN with non-local modules to undertake the fundamental frequency extraction task.Compared with the continuously stacked CNN,CNN with non-local modules can effectively obtain the relationship between two positions,that is,they can quickly capture long-range dependencies because they ignore the Euclidean distance.In the pitch estimation task,when non-local modules are added to CNNs to calculate the similarity between all audio sample points in each frame,they help capture the global dependencies between frame-to-frame and sample-to-sample with slightly increased computational complexity.Moreover,non-local modules do not alter the input and output dimensions;thus,they can be easily integrated with CNN.The experimental results demonstrate that the Mean Absolute Error(MAE)of the proposed method is only 4.7,which is at least 0.7 lower than that of the baseline model,and state-of-the-art performance is obtained.
关 键 词:基频 语音信号处理 数据驱动 卷积神经网络 非局部模块
分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.117.246.69