基于字串切分统计词典的繁体中文拼写检错方法  

Approach to detect spelling errors of traditional Chinese sentences by statistic dictionaries based on n-gram segmentation

在线阅读下载全文

作  者:王勇[1] 顾磊[1] 

机构地区:[1]南京邮电大学计算机学院,南京210003

出  处:《计算机应用研究》2016年第5期1370-1373,1378,共5页Application Research of Computers

基  金:国家自然科学基金资助项目(61302157);国家教育部人文社会科学研究青年基金资助项目(12YJC870008);江苏省教育厅高校哲学社会科学基金资助项目(2013SJB870004);江苏省社科研究文化精品课题(12SWC-030)

摘  要:针对繁体中文拼写检错的问题进行了研究,提出一种基于字串切分统计词典的检错方法。利用语料库中字串出现的频率信息作为检错依据,根据字串及其频率信息来建立统计词典,并设计了基于统计规则评判的检错算法。以SIGHAN7会议中文拼写校验任务中用于检错评测的1 000句测试集作为实验测试集,并与此会议提交的结果进行比较,实验结果表明,与基于复杂语言模型的检错方法相比,该方法在实现简单的同时也有很好的检错效果,获得了较高的准确率和精确率以及较低的误报率。To address the challenge that how to detect spelling errors of traditional Chinese sentences automatically,this paper presented an approach using statistic dictionaries based on n-gram segmentation. The approach was based on the n-gram information collected from the statistic dictionaries which was built by corpus. This paper proposed a statistic rule based algorithm to detect spelling errors as well. Regarding the 1000 sentences of detection task provided by SIGHAN7 as the test set,it was found that such approach got good performances in error location accuracy,error location precision and false-alarm rate. Experimental results show that compared with the methods based on complex language model,this approach is easyer to realize and achieves better results at the same time.

关 键 词:中文语言处理 繁体中文拼写检错 中文分词 字串切分 统计词典 混淆集 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象