检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:胡从刚 申艺翔 孙永奇 赵思聪[3] Hu Conggang;Shen Yixiang;Sun Yongqi;Zhao Sicong(Key Laboratory of Big Data&Artificial Intelligence in Transportation,Beijing 100044,China;School of Computer&Information Technology,Beijing Jiaotong University,Beijing 100044,China;Beijing Aerocim Technology Co.,Ltd.of CASIC,Beijing 102308,China)
机构地区:[1]交通大数据与人工智能教育部重点实验室,北京100044 [2]北京交通大学计算机与信息技术学院,北京100044 [3]北京航天晨信科技有限责任公司,北京102308
出 处:《计算机应用研究》2024年第7期2018-2024,共7页Application Research of Computers
基 金:科技创新2030——“新一代人工智能”重大资助项目(2021ZD0113002)。
摘 要:针对Conformer编码器的声学输入网络对FBank语音信息提取不足和通道特征信息缺失问题,提出一种RepVGG-SE-Conformer的端到端语音识别方法。首先,利用RepVGG的多分支结构,增强模型的语音信息提取能力,而在模型推理时通过结构重参数化将多分支融合为单分支,以降低计算复杂度、加快模型推理速度。然后,利用基于压缩和激励网络的通道注意力机制弥补缺失的通道特征信息,以提高语音识别准确率。最后,在公开数据集Aishell-1上的实验结果表明:相较于Conformer,所提出方法的字错误率降低了10.67%,验证了方法的先进性。此外,RepVGG-SE声学输入网络能够有效提高多种Transformer变体的端到端语音识别模型的整体性能,具有很好的泛化能力。The acoustic input network based on the Conformer encoder has the problem of insufficient extraction of FBank speech information and missing channel feature information.This paper proposed an end-to-end method based on RepVGG-SE-Conformer for speech recognition to solve these problems.Firstly,the proposed model used the multi-branch structure of RepVGG to enhance the speech information extraction capability,and using the structural re-parameterization fused the multi-branch into a single branch to reduce the computational complexity and speed up the model inference.Then,based on the squeeze-and-excitation network,the channel attention mechanism made up for the missing channel feature information to improve speech recognition accuracy.Finally,the experimental results on the public dataset Aishell-1 show that the proposed method’s character error rate is reduced by 10.67%compared with Conformer,and the advancement of the method is verified.In addition,the proposed RepVGG-SE acoustic input network has good generalization ability in the end-to-end scene,which can effectively improve the overall performance of speech recognition models based on Transformer variants.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222