检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈倩 乐红兵[1] CHEN Qian;LE Hongbing(School of Internet of Things Engineering,Jiangnan University,Wuxi 214122)
出 处:《计算机与数字工程》2020年第9期2238-2243,共6页Computer & Digital Engineering
摘 要:词典是汉语自动分词的基础,减少交集型歧义可以提高分词的准确率。在基于词典切分中,传统的Trie树每个节点存储一个字符,构建时产生了很多空指针。为了优化词典存储结构,在Trie树的基础上,采用双字Hash机制:把Trie索引树的深度限制为2,词的剩余字符串则按序组成类似"整词二分"的词典正文,并在每组词语的叶子节点上增加词频和词性的属性值,用于后序的交集型歧义识别。加载了搜狗实验室中文互联网语料统计出的15万条高频词,平均大小为60KB的5篇不同领域的测试语料作为测试样本。实验结果表明:相比其他词典而言,双字Hash分词速度得到显著提高,分词的正确率达到93.1%,基本可以满足实用型中文信息处理系统的需要。Dictionary is the basis of Chinese automatic word segmentation.Reducing intersection ambiguity can improve the accuracy of word segmentation.In dictionary-based segmentation,the traditional Trie tree stores one character in per node and generates many null pointers when it is constructed.In order to optimize the storage structure of dictionary,a double hash storage mechanism is adopted on the basis of Trie tree.The depth of the Trie index tree is restricted to 2,and the remaining strings of words are sequentially organized into dictionary texts which are similar to"whole word dichotomy".In addition,the attribute values of word frequency and part of speech are added to the leaf nodes of each group of words,which can be used to recognize the intersection ambiguity in the sequential sequence.A total of 150,000 high-frequency words are loaded from the Chinese Internet corpus of Sogou Laboratory.Five test corpuses with an average size of 60 KB are used as test samples in different fields.The experimental results show that compared with other dictionaries,the speed of double-character Hash word segmentation is significantly improved,and the accuracy of word segmentation reaches 93.1%.It can basically meet the needs of practical Chinese information processing system.
关 键 词:词典 自动分词 歧义切分 TRIE树 双字Hash存储 词频 词性
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.219.43.26