检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:车娜[1,2,3] 朱奕明 赵剑[1,2,3] 孙磊 史丽娟[2,3,4] 曾现伟 CHE Na;ZHU Yi-ming;ZHAO Jia;SUN Lei;SHI Li-juan;ZENG Xian-wei(School of Computer Science and Technology,Changchun University,Changchun 130022,China;Jilin Provincial Key Laboratory of Human Health State Identification and Function Enhancement,Changchun University,Changchun 130022,China;Key Laboratory of Intelligent Rehabilitation and Barrier-free Access for the Disabled,Ministry of Education,Changchun University,Changchun 130022,China;School of Electronic and Information Engineering,Changchun University,Changchun 130022,China)
机构地区:[1]长春大学计算机科学技术学院,长春130022 [2]长春大学吉林省人体健康状态辨识与机能增强重点实验室,长春130022 [3]长春大学残障人士智能康复及无障碍教育部重点实验室,长春130022 [4]长春大学电子信息工程学院,长春130022
出 处:《吉林大学学报(工学版)》2024年第10期2984-2993,共10页Journal of Jilin University:Engineering and Technology Edition
基 金:吉林省教育厅科技计划重点项目(JJKH20230675KJ);吉林省特殊教育学会重点项目(JT2022Z001);横向课题(2022JBH08L15);吉林省科技厅(YDZJ202303CGZH010,YDZJ202301ZYTS496);吉林省社会科学研究项目(JJKH20231054SK);吉林省教育科学“十四五”规划重点课题(ZD21100)。
摘 要:针对视听语音识别技术存在的数据需求量大、音视频数据对齐、噪声鲁棒性等问题,深入分析了联结主义时序分类器、长短期记忆神经网络、Transformer、Conformer四类核心模型的特点与优势,归纳了各模型的适用场景,并提出了优化模型性能的思路和方法。基于主流数据集和常用评价标准,对模型性能进行了量化分析。结果表明:CTC在噪声条件下性能波动较大,LSTM能有效捕捉长时序依赖,Transformer和Conformer在跨模态任务中可显著降低识别错误率。最后,从自监督训练和噪声鲁棒性两个层面,展望了未来的研究方向。Aiming at the problems of large data demand,audio and video data alignment,and noise robustness in audio visual speech recognition technology,this paper analyzes in depth the features and advantages of the four types of core models,namely,connectionist temporal classification,long short term memory,Transformer,and Conformer,summarizes the applicable scenarios of each model,and puts forward the ideas and methods to optimize the performance of the models.Then the model performance is quantitatively analyzed based on mainstream datasets and commonly used evaluation criteria.The results show that CTC has large performance fluctuations under noisy conditions,LSTM can effectively capture long temporal dependencies,and Transformer and Conformer can significantly reduce the recognition error rate in cross-modal tasks.Finally,future research directions are envisioned at the levels of self-supervised training and noise robustness.
关 键 词:计算机应用技术 视听语音识别 深度学习 联结主义
分 类 号:TN912.34[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.20.221.0