检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:吴兰[1] 杨攀 李斌全 王涵 WU Lan;YANG Pan;LI Binquan;WANG Han(School of Electrical Engineering,Henan University of Technology,Zhengzhou,Henan,450001,China)
机构地区:[1]河南工业大学电气工程学院,河南郑州450001
出 处:《广西科学》2023年第1期52-60,共9页Guangxi Sciences
基 金:国家自然科学基金项目(61973103);河南省自然科学基金项目(222300420039);郑州市科技局自然科学项目(21ZZXTCX01)资助。
摘 要:视听语音识别(Audio-Visual Speech Recognition,AVSR)技术利用唇读和语音识别(Audio-Visual Speech Recognition,AVSR)的关联性和互补性可有效提高字符识别准确率。针对唇读的识别率远低于语音识别、语音信号易受噪声破坏、现有的视听语音识别方法在大词汇量环境噪声中的识别率大幅降低等问题,本文提出一种多模态视听语音识别(Multi-modality Audio-Visual Speech Recognition,MAVSR)方法。该方法基于自注意力机制构建双流前端编码模型,引入模态控制器解决环境噪声下音频模态占据主导地位而导致的各模态识别性能不均衡问题,提高识别稳定性与鲁棒性,构建基于一维卷积的多模态特征融合网络,解决音视频数据异构问题,提升音视频模态间的关联性与互补性。与现有主流方法对比,在仅音频、仅视频、音视频融合3种任务下,该方法的识别准确率提升7.58%以上。Audio-Visual Speech Recognition(AVSR) technology can effectively improve the accuracy of character recognition by using the relevance and complementarity of lip reading and speech recognition.In view of the problems that the recognition rate of lip reading is much lower than that of speech recognition, the speech signal is easily damaged by noise, and the recognition rate of existing Audio-Visual Speech Recognition(AVSR) methods in large vocabulary environment noise is greatly reduced, a Multi-modality Audio-Visual Speech Recognition(MAVSR) method is proposed.This method constructs a dual-stream front-end coding model based on the self-attention mechanism, and introduces a modal controller to solve the problem of unbalanced recognition performance of each mode caused by the dominance of audio modes in the environment noise, and improves the stability and robustness of recognition.A multi-modal feature fusion network based on one-dimensional convolution is constructed to solve the heterogeneous problem of audio and video data and improve the correlation and complementarity between audio and video modes.Compared with the existing mainstream methods, the recognition accuracy of this method is increased by more than 7.58% under the three tasks of audio-only, video-only, and audio-video fusion.
关 键 词:注意力机制 多模态 视听语音识别 唇读 语音识别
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.135.63.86