基于改进的N-gram模型和知识库的文本查错算法  被引量:9

THE ALGORITHM OF TEXT ERROR-DETECTING BASED ON IMPROVED N-GRAM MODEL AND KNOWLEDGE BASES

在线阅读下载全文

作  者:王琼 旷文珍[2] 许丽[1] Wang Qiong;Kuang Wenzhen;Xu Li(College of Automation and Electrical Engineering,Lanzhou Jiaotong University,Lanzhou 730070,Gansu,China;Gansu Research Center of Automation Engineering Technology for Industry&Transportation,Research Institute,Lanzhou Jiaotong University,Lanzhou 730070,Gansu,China)

机构地区:[1]兰州交通大学自动化与电气工程学院,甘肃兰州730070 [2]兰州交通大学研究院甘肃工业交通自动化工程技术研究中心,甘肃兰州730070

出  处:《计算机应用与软件》2021年第10期310-315,320,共7页Computer Applications and Software

基  金:中国铁路总公司科技研究开发计划重点项目(2016X003-H);甘肃省工业交通自动化工程技术研究中心2019年开放基金项目(GSITA201904)。

摘  要:针对语音识别引擎识别后文本容易发生散串错误和同音字错误,提出一种基于改进的N-gram模型和专业术语查错知识库的查错算法。采用Witten-Bell平滑算法解决N-gram模型训练过程中数据稀疏问题,并对N-gram模型增加权重分配,增强模型对散串错误的查错率。针对铁路特殊用语规定和同音字错误,构建一种适应关键字的专业术语查错知识库,实现知识库的自动更新。经过实验对比,该算法查错确率为87.9%,相比通用的N-gram查错模型提高52.8百分点。该算法的提出为后续的纠错以及语音识别准确率的提高奠定了基础,并对铁路车务系统语音识别技术的应用具有重要意义。For the text recognized by the speech recognition engine,it is easy to make the errors of scattered string and homophone.Aiming at the type of errors,an algorithm combining improved N-gram model and professional terminology error-detecting knowledge bases is proposed.The Witten-Bell smoothing algorithm was used to solve the data sparsity problem in the N-gram model training process,and the weight distribution was added to the N-gram model,which enhanced the error-detecting rate of the model for the scattered string errors.Aiming at the railway special term regulations and homophone errors,professional terminology error-detecting knowledge bases adapted to keywords was constructed to realize automatic update of the knowledge bases.After experimental comparison,the error-detecting rate of this algorithm is 87.9%,which is 52.8 percentage points higher than the general N-gram error-detecting model.The algorithm provides a basis for subsequent error-correction and the improvement of speech recognition accuracy,and it is of great significance for the application of speech recognition technology in the railway train operation system.

关 键 词:N-GRAM模型 铁路车务标准用语 散串错误 专业术语查错知识库 同音字错误 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象