基于联结主义的视听语音识别方法

Connectionism based audio-visual speech recognition method

作　　者：车娜[1,2,3] 朱奕明赵剑[1,2,3] 孙磊史丽娟[2,3,4] 曾现伟 CHE Na;ZHU Yi-ming;ZHAO Jia;SUN Lei;SHI Li-juan;ZENG Xian-wei(School of Computer Science and Technology,Changchun University,Changchun 130022,China;Jilin Provincial Key Laboratory of Human Health State Identification and Function Enhancement,Changchun University,Changchun 130022,China;Key Laboratory of Intelligent Rehabilitation and Barrier-free Access for the Disabled,Ministry of Education,Changchun University,Changchun 130022,China;School of Electronic and Information Engineering,Changchun University,Changchun 130022,China)

机构地区：[1]长春大学计算机科学技术学院,长春130022 [2]长春大学吉林省人体健康状态辨识与机能增强重点实验室,长春130022 [3]长春大学残障人士智能康复及无障碍教育部重点实验室,长春130022 [4]长春大学电子信息工程学院,长春130022

出　　处：《吉林大学学报（工学版）》2024年第10期2984-2993,共10页Journal of Jilin University:Engineering and Technology Edition

基　　金：吉林省教育厅科技计划重点项目(JJKH20230675KJ);吉林省特殊教育学会重点项目(JT2022Z001);横向课题(2022JBH08L15);吉林省科技厅(YDZJ202303CGZH010,YDZJ202301ZYTS496);吉林省社会科学研究项目(JJKH20231054SK);吉林省教育科学“十四五”规划重点课题(ZD21100)。

摘　　要：针对视听语音识别技术存在的数据需求量大、音视频数据对齐、噪声鲁棒性等问题,深入分析了联结主义时序分类器、长短期记忆神经网络、Transformer、Conformer四类核心模型的特点与优势,归纳了各模型的适用场景,并提出了优化模型性能的思路和方法。基于主流数据集和常用评价标准,对模型性能进行了量化分析。结果表明:CTC在噪声条件下性能波动较大,LSTM能有效捕捉长时序依赖,Transformer和Conformer在跨模态任务中可显著降低识别错误率。最后,从自监督训练和噪声鲁棒性两个层面,展望了未来的研究方向。Aiming at the problems of large data demand,audio and video data alignment,and noise robustness in audio visual speech recognition technology,this paper analyzes in depth the features and advantages of the four types of core models,namely,connectionist temporal classification,long short term memory,Transformer,and Conformer,summarizes the applicable scenarios of each model,and puts forward the ideas and methods to optimize the performance of the models.Then the model performance is quantitatively analyzed based on mainstream datasets and commonly used evaluation criteria.The results show that CTC has large performance fluctuations under noisy conditions,LSTM can effectively capture long temporal dependencies,and Transformer and Conformer can significantly reduce the recognition error rate in cross-modal tasks.Finally,future research directions are envisioned at the levels of self-supervised training and noise robustness.

关键词：计算机应用技术视听语音识别深度学习联结主义

分类号：TN912.34[电子电信—通信与信息系统]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于联结主义的视听语音识别方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于联结主义的视听语音识别方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索