标签同步解码算法及其在语音识别中的应用  被引量:10

Label Synchronous Decoding for Speech Recognition

在线阅读下载全文

作  者:陈哲怀 郑文露 游永彬 钱彦旻[1,2] 俞凯[1,2] CHEN Zhe -Huai;ZHENG Wen-Lu;YOU Yong-Bin;QIAN Yan-Min;YU Kai(Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai 200240;SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240;Suzhou Institute of Artificial Intelligence, Shanghai Jiao Tong University, Suzhou, Jiangsu 215000;AISpeech Ltd., Suzhou, Jiangsu 215000)

机构地区:[1]上海交通大学智能交互与认知工程上海高校重点实验室,上海200240 [2]上海交通大学计算机科学与工程系智能语音实验室,上海200240 [3]上海交通大学苏州人工智能研究院,江苏苏州215000 [4]苏州思必驰信息科技有限公司,江苏苏州215000

出  处:《计算机学报》2019年第7期1511-1523,共13页Chinese Journal of Computers

基  金:国家重点研发计划“智能机器人”重点专项(2017YFB1302400);国家自然科学基金项目(U1736202);江苏省基础研究计划(BE2016078)资助~~

摘  要:自动语音识别(Automatic Speech Recognition,ASR)等序列标注任务的一个显著特点是其对相邻帧的时序序列关联性建模.用于对相邻帧进行时序建模的主流序列模型包括隐马尔可夫模型(Hidden Markov Model, HMM)和连接时序模型(Connectionist Temporal Classification,CTC).针对这些模型,当前主流的推理方法是帧层面的维特比束搜索算法,该算法复杂度很高,限制了语音识别的广泛应用.深度学习的发展使得更强的上下文和历史建模成为可能.通过引入blank单元,端到端建模系统能够直接预测标签在给定特征下的后验概率.该文系统地提出了一系列方法,通过使用高效的blank结构和后处理方法,使得搜索解码过程从逐帧同步变为标签同步.该系列通用方法在隐马尔可夫模型和连接时序模型上均得到了验证.结果表明,在Switchboard数据集上,不损失性能的前提下,实验取得了2~4倍的加速.该文同时研究了搜索空间、候选序列剪枝、转移模型、降帧率等对加速比的影响,并在所有情况下取得一致性加速。A unique phenomenon in human speech is the variable lengths in acoustic waves and linguistic words. Hence automatic speech recognition (ASR) requires both pattern classification and state alignment modeling between input and output sequences, called sequence prediction problem. In the inference stage, a speech recognizer is to find a sequence of labels whose corresponding acoustic and language models best match the input feature, called decoding, which determines the recognition speed and precision in real application. The most recent milestone of ASR is the application of deep neural networks (DNN) in acoustic and language modeling. However, those successful applications are still based on the traditional formulation of speech recognition and the inference stage is unchanged. In this paper, we aim to improve the decoding algorithm in the inference stage. The dominant decoding method nowadays is frame synchronous Viterbi beam search whose algorithm complexity is linear with the length of the acoustic waves. Despite the wide adoption, the approach has several weakness.(1) It is an equal interval search algorithm and inefficient to deal with the variable length in the feature sequence.(2) As the sequence is decomposed to frame level as the feature sequence, the model granularity is small and the search space is large, e.g., Hidden Markov Model states of different histories.(3) Greedy beam pruning is conducted at each frame, which is usually hard to balance search efficiency and search errors. In this paper, based on deep learning based confusion blank symbol modeling, we systematically propose label synchronous decoding (LSD) to transform the search process from frame level to label level and obtain significant speedups. We propose to transform the search process above from frame level to label level whose complexity is linear with the length of linguistic words. Namely, we utilize effective blank structure and apply efficient post-processing of blank during inference before doing Viterbi search. The post-processin

关 键 词:自动语音识别 隐马尔可夫模型 连接时序模型 逐帧同步解码 标签同步解码 可变帧率 剪枝 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象