检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]江苏大学计算机科学与通信工程学院,江苏镇江212013
出 处:《信息技术》2017年第11期167-171,共5页Information Technology
摘 要:中文分词一直是中文类搜索引擎的重要前提之一。针对经典的机械分词方法中字符串匹配的最长匹配字的选择问题,提出了一种基于Hash的词典结构,避免了最长匹配字的过长或过短。对于歧义的发现,引入了回溯机制,即算法在每次查询词语完毕后,再以查询的词语的最后一个字为首字,开始进行新一轮的查询。对于回溯机制带来的查询次数倍增问题,提出对词语末字的检验是否能成为首字的算法,减少查询次数和时间复杂度。该方法相比于其他融合方法,具有较快的查询速度和较好的歧义处理能力。Chinese word segmentation is one of the important preconditions of Chinese search engine. For the longest matching word selection in the string matching of classical method of mechanical word segmentation,this paper proposed a Hash-based dictionary structure,to avoid the longest matching word is too long or too short. For the discovery of ambiguity,the paper introduces the backtracking mechanism,that is,when the algorithm in each querying of word is completed,the algorithm query the last character of the word,finally using the last character of first word to start a new round of inquiry. However,the backtracking mechanism has brought about the problem of doubling the time of queries,so it proposed that the last character of the word can become the first word,reduces the number of queries and time complexity. Compared with other fusion methods,the proposed method has a faster searching speed and the ability to deal with ambiguity.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3