检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]清华大学计算机科学与技术系,北京100084 [2]清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心,北京100084
出 处:《清华大学学报(自然科学版)》2009年第S1期1328-1332,共5页Journal of Tsinghua University(Science and Technology)
基 金:国家自然科学基金资助项目(60703051)
摘 要:主流商业搜索引擎主要基于关键词精确匹配技术。为提高在用户的输入错误时的检索效率,提出了有索引的汉语模糊匹配算法。该算法采用汉字、拼音和拼音改良的编辑距离这3种汉字相似程度的不同度量方式,对用户查询进行扩展,将模糊匹配转化为多个精确匹配,对精确匹配的结果按与查询串的相似程度进行排序。在实验中,将该方法应用于网页文本语料库中。在使用基于拼音改良的编辑距离度量方式时,在时间和空间复杂度增长不大的情况下,该方法取得了60.42%的准确率与50.41%召回率。The exact matching of is key to popular commercial search engines.A Chinese approximate matching method with an index structure was developed to achieve better retrieval when the input contains errors.Three types of similarity measurement between two Chinese strings were developed based on the character edit-distance,the Pinyin edit-distance and the Pinyin improved edit-distance.The similarity measurements were used to expand the user's query so that the approximate matching task can be represented as several exact matching sub-tasks.The results of these exact matchings are merged and sorted by their similarity to the original query.Tests on a webpage text database gave a 50.4% recall rate with the Pinyin improved edit-distance with a 60.4% precision with a small increase in time and space complexity.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229