基于端到端语音识别的关键词检索技术研究  被引量:16

Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition

在线阅读下载全文

作  者:杨润延 程高峰 刘建[1] YANG Run-yan;CHENG Gao-feng;LIU Jian(Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区:[1]中国科学院声学研究所,北京100190 [2]中国科学院大学,北京100049

出  处:《计算机科学》2022年第1期53-58,共6页Computer Science

基  金:国家重点研发计划(2020AAA0108002)。

摘  要:近十年来,端到端的语音识别框架发展迅速。区别于传统的基于隐马尔可夫模型的语音识别框架,端到端语音识别拥有众多新特性,而且可以达到相同或更优秀的性能。因此,端到端语音识别吸引了越来越多的关注,已经成为了与传统语音识别并列的第二类主流框架。针对端到端语音识别无法提供关键词检索所需的关键词准确时间起止点与可靠置信度的问题,提出了一种基于端到端语音识别和帧级别对齐的关键词检索框架,并在越南语数据集上进行了实验验证。首先,使用端到端语音识别模型解码待测语句,得到N-最佳假设;然后,从一个与上述识别模型联合训练的音素分类器中获得逐帧音素概率,使用一个基于动态规划的对齐算法为检出的N-最佳假设和逐帧音素概率进行对齐,进而得到N-最佳假设中各个单词的时间起止点和置信度;最后,在N-最佳假设中匹配关键词,并利用时间起止点和置信度合并重复匹配的关键词,得到最终检索结果。在一个越南语自由交谈数据集上的实验表明,提出的关键词检索系统的F1值可以达到77.6%,相对于传统的基于隐马尔可夫模型的关键词检索系统的F1值提升了7.8%,而且可以提供可靠的关键词置信度。In the past decade,end-to-end automatic speech recognition (ASR)frameworks have developed rapidly.End-to-end ASR has shown not only very different characteristics from traditional ASR based on hidden Markov models(HMMs),but also advanced performances.Thus,end-to-end ASR is being more and more popular and has become another major type of ASR frameworks.A keyword search(KWS)framework based on end-to-end ASR and frame-synchronous alignment is proposed for solving the problem that end-to-end ASR cannot provide accurate keyword timestamps and confidence scores,and experimental verification on a Vietnamese dataset is made.First,utterances are decoded by an end-to-end Uyghur ASR system,obtaining N-best hypotheses.Next,a dynamic programming-based alignment algorithm is implemented on each of these ASR hypotheses and perframe phoneme probabilities,which are provided by aphoneme classifier jointly trained with the ASR model,to compute time stamps and confidence scores for each word in N-best hypotheses.Then,final KWS result is obtained by detecting keywords within N-best hypotheses and removing duplicated keyword occurrences according to time stamps and confident scores.Experimental results on a Vietnamese conversational telephone speech dataset show that the proposed KWS system achieves an F1 score of 77.6%,which is relatively 7.8%higher than the F1 score of the traditional HMM-based KWS system.The proposed system also provides reliable keyword confidence scores.

关 键 词:检索 语音识别 端到端 帧级别对齐 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象