大词汇量环境噪声下的多模态视听语音识别方法  被引量:4

A Multi modality Audio-Visual Speech Recognition Method under Large Vocabulary Environmental Noise

在线阅读下载全文

作  者:吴兰[1] 杨攀 李斌全 王涵 WU Lan;YANG Pan;LI Binquan;WANG Han(School of Electrical Engineering,Henan University of Technology,Zhengzhou,Henan,450001,China)

机构地区:[1]河南工业大学电气工程学院,河南郑州450001

出  处:《广西科学》2023年第1期52-60,共9页Guangxi Sciences

基  金:国家自然科学基金项目(61973103);河南省自然科学基金项目(222300420039);郑州市科技局自然科学项目(21ZZXTCX01)资助。

摘  要:视听语音识别(Audio-Visual Speech Recognition,AVSR)技术利用唇读和语音识别(Audio-Visual Speech Recognition,AVSR)的关联性和互补性可有效提高字符识别准确率。针对唇读的识别率远低于语音识别、语音信号易受噪声破坏、现有的视听语音识别方法在大词汇量环境噪声中的识别率大幅降低等问题,本文提出一种多模态视听语音识别(Multi-modality Audio-Visual Speech Recognition,MAVSR)方法。该方法基于自注意力机制构建双流前端编码模型,引入模态控制器解决环境噪声下音频模态占据主导地位而导致的各模态识别性能不均衡问题,提高识别稳定性与鲁棒性,构建基于一维卷积的多模态特征融合网络,解决音视频数据异构问题,提升音视频模态间的关联性与互补性。与现有主流方法对比,在仅音频、仅视频、音视频融合3种任务下,该方法的识别准确率提升7.58%以上。Audio-Visual Speech Recognition(AVSR) technology can effectively improve the accuracy of character recognition by using the relevance and complementarity of lip reading and speech recognition.In view of the problems that the recognition rate of lip reading is much lower than that of speech recognition, the speech signal is easily damaged by noise, and the recognition rate of existing Audio-Visual Speech Recognition(AVSR) methods in large vocabulary environment noise is greatly reduced, a Multi-modality Audio-Visual Speech Recognition(MAVSR) method is proposed.This method constructs a dual-stream front-end coding model based on the self-attention mechanism, and introduces a modal controller to solve the problem of unbalanced recognition performance of each mode caused by the dominance of audio modes in the environment noise, and improves the stability and robustness of recognition.A multi-modal feature fusion network based on one-dimensional convolution is constructed to solve the heterogeneous problem of audio and video data and improve the correlation and complementarity between audio and video modes.Compared with the existing mainstream methods, the recognition accuracy of this method is increased by more than 7.58% under the three tasks of audio-only, video-only, and audio-video fusion.

关 键 词:注意力机制 多模态 视听语音识别 唇读 语音识别 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象