双通道解码的端到端连续语音识别  

End-to-end continuous speech recognition with dual-channel decoding

在线阅读下载全文

作  者:朱洋 曾庆宁[1] 赵学军 ZHU Yang;ZENG Qingning;ZHAO Xuejun(School of Information and Communication,Guilin University of Electronic Technology,Guilin 541004,China)

机构地区:[1]桂林电子科技大学信息与通信学院,广西桂林541004

出  处:《桂林电子科技大学学报》2024年第2期167-173,共7页Journal of Guilin University of Electronic Technology

基  金:国家自然科学基金(61961009);广西无线宽带通信与信号处理重点实验室基金(GXKL06200107);桂林电子科技大学研究生教育创新计划(2022YCXS042)。

摘  要:在端到端连续语音识别系统中,完全基于自注意力机制的Transformer模型相比传统的混合模型提高了准确率。Conformer模型是在Transformer模型基础上增加一个擅长提取局部特征的卷积模块,将该模型作为整个识别系统的编码器,解码器使用注意力机制,注意力模型只适合短句子识别,并且在数据集存在噪声时会导致网络训练不稳定,添加CTC模型的序列对齐特性辅助训练来帮助模型收敛更快。针对单通道解码可在识别准确率上进一步优化的问题,提出了CTC与Attention双通道解码模型,将双通道解码与单一的CTC解码和单一的Attention解码进行对比验证,结果表明双通道解码在识别性能上提升了1%。针对在噪声环境下识别效果降低的问题,提出对端到端网络添加语言模型的方法。将N-gram语言模型加入网络中进行验证,结果表明在信噪比为10 dB的高噪声环境下,语言模型能够使字错率下降3.5%,提高了语音识别系统的鲁棒性。In the end-to-end continuous speech recognition system,the Transformer model based entirely on the self-attention mechanism improves accuracy compared to the traditional hybrid model.The Conformer model adds a convolution module that is good at extracting local features based on the Transformer model,and uses this model as the encoder of the entire recognition system.The decoder uses an attention mechanism.Since the attention model is only suitable for short sentence recognition and will cause network training instability when there is noise in the data set,the sequence alignment characteristics of the CTC model are added to assist training to help the model converge faster.In view of the problem that single-channel decoding can further optimize the recognition accuracy,a dual-channel decoding model of CTC and Attention was proposed.The dual-channel decoding was compared and verified with a single CTC decoding and a single Attention decoding.The results show that dual-channel decoding is more effective in recognition.Performance can be improved by 1%.In order to solve the problem of reduced recognition effect in noisy environment,a method of adding language model to the end-to-end network was proposed.The N-gram language model was added to the network for verification.The results show that in a high-noise environment with a signal-to-noise ratio of 10 dB,the language model could reduce the word error rate by 3.5%,improving the robustness of the speech recognition system.

关 键 词:语音识别 编码器 解码器 端到端 双通道 语言模型 

分 类 号:TN912.3[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象