汉字种子混淆集的构建方法研究  被引量:7

Research on Method of Constructing Chinese Character Confusion Set

在线阅读下载全文

作  者:施恒利 刘亮亮[1,2] 王石[3] 符建辉[3] 张再跃[1] 曹存根[3] 

机构地区:[1]江苏科技大学计算机科学与工程学院,镇江212003 [2]中国科学院大学研究生院,北京100049 [3]中国科学院计算技术研究所,北京100190

出  处:《计算机科学》2014年第8期229-232,253,共5页Computer Science

基  金:国家自然科学基金重点项目(91224006;61173063;61035004);国家自然科学基金面上项目(61203284)资助

摘  要:汉字混淆集是错别字识别中的重要资源之一。在本项研究中,首先手工整理了11935个汉字的可能的错别字,然后以这些汉字为节点、"可错成"关系为边,将混淆集构造成一个错别字混淆集图。由于人工总结错别字具有很大的局限性,因此在种子错别字混淆集图的基础上,设计了自扩展算法和开源外部补充算法来对错别字混淆集图进行扩展,以发现新的错别字对。根据实验,新发现了15133组错别字对。经过随机抽样校对,准确率达到87.35%。The set of Chinese characters which is easily confused is one of the important sources during the process of i- dentifying wrongly written characters. In the study, firstly we sorted out 11935 possibly-wrongly written characters by hand. Then taking those characters as nodes and "possibly-wrongly written characters" relations as sections, we con- structed the set of wrongly written characters which is easily confused into a diagram. Due to the great limitation of manually sorting out wrongly written characters, on the basis of the diagram, we designed the internal-expanding algo- rithm that expands the set of wrongly written characters and the open source data external-supplementing algorithm that supplements the set of wrongly written characters through large quantity of corpus. In that way, we would expand the diagram and new pairs of wrongly written characters. According to the experiment, we newly found 15133 groups of wrongly written characters pairs. After proofreading samples at random, accuracy reachs 87.35%.

关 键 词:错别字混淆集 自扩展 开源数据 基于规则和统计 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象