检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:施恒利 刘亮亮[1,2] 王石[3] 符建辉[3] 张再跃[1] 曹存根[3]
机构地区:[1]江苏科技大学计算机科学与工程学院,镇江212003 [2]中国科学院大学研究生院,北京100049 [3]中国科学院计算技术研究所,北京100190
出 处:《计算机科学》2014年第8期229-232,253,共5页Computer Science
基 金:国家自然科学基金重点项目(91224006;61173063;61035004);国家自然科学基金面上项目(61203284)资助
摘 要:汉字混淆集是错别字识别中的重要资源之一。在本项研究中,首先手工整理了11935个汉字的可能的错别字,然后以这些汉字为节点、"可错成"关系为边,将混淆集构造成一个错别字混淆集图。由于人工总结错别字具有很大的局限性,因此在种子错别字混淆集图的基础上,设计了自扩展算法和开源外部补充算法来对错别字混淆集图进行扩展,以发现新的错别字对。根据实验,新发现了15133组错别字对。经过随机抽样校对,准确率达到87.35%。The set of Chinese characters which is easily confused is one of the important sources during the process of i- dentifying wrongly written characters. In the study, firstly we sorted out 11935 possibly-wrongly written characters by hand. Then taking those characters as nodes and "possibly-wrongly written characters" relations as sections, we con- structed the set of wrongly written characters which is easily confused into a diagram. Due to the great limitation of manually sorting out wrongly written characters, on the basis of the diagram, we designed the internal-expanding algo- rithm that expands the set of wrongly written characters and the open source data external-supplementing algorithm that supplements the set of wrongly written characters through large quantity of corpus. In that way, we would expand the diagram and new pairs of wrongly written characters. According to the experiment, we newly found 15133 groups of wrongly written characters pairs. After proofreading samples at random, accuracy reachs 87.35%.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28