基于改进Trie树的歧义消解方法  被引量:1

Ambiguity Resolution Method Based on Improved Trie Tree

在线阅读下载全文

作  者:陈倩 乐红兵[1] CHEN Qian;LE Hongbing(School of Internet of Things Engineering,Jiangnan University,Wuxi 214122)

机构地区:[1]江南大学物联网工程学院,无锡214122

出  处:《计算机与数字工程》2020年第9期2238-2243,共6页Computer & Digital Engineering

摘  要:词典是汉语自动分词的基础,减少交集型歧义可以提高分词的准确率。在基于词典切分中,传统的Trie树每个节点存储一个字符,构建时产生了很多空指针。为了优化词典存储结构,在Trie树的基础上,采用双字Hash机制:把Trie索引树的深度限制为2,词的剩余字符串则按序组成类似"整词二分"的词典正文,并在每组词语的叶子节点上增加词频和词性的属性值,用于后序的交集型歧义识别。加载了搜狗实验室中文互联网语料统计出的15万条高频词,平均大小为60KB的5篇不同领域的测试语料作为测试样本。实验结果表明:相比其他词典而言,双字Hash分词速度得到显著提高,分词的正确率达到93.1%,基本可以满足实用型中文信息处理系统的需要。Dictionary is the basis of Chinese automatic word segmentation.Reducing intersection ambiguity can improve the accuracy of word segmentation.In dictionary-based segmentation,the traditional Trie tree stores one character in per node and generates many null pointers when it is constructed.In order to optimize the storage structure of dictionary,a double hash storage mechanism is adopted on the basis of Trie tree.The depth of the Trie index tree is restricted to 2,and the remaining strings of words are sequentially organized into dictionary texts which are similar to"whole word dichotomy".In addition,the attribute values of word frequency and part of speech are added to the leaf nodes of each group of words,which can be used to recognize the intersection ambiguity in the sequential sequence.A total of 150,000 high-frequency words are loaded from the Chinese Internet corpus of Sogou Laboratory.Five test corpuses with an average size of 60 KB are used as test samples in different fields.The experimental results show that compared with other dictionaries,the speed of double-character Hash word segmentation is significantly improved,and the accuracy of word segmentation reaches 93.1%.It can basically meet the needs of practical Chinese information processing system.

关 键 词:词典 自动分词 歧义切分 TRIE树 双字Hash存储 词频 词性 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象