检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]华中师范大学国家文化产业研究中心,武汉430079
出 处:《计算机应用研究》2017年第6期1756-1761,共6页Application Research of Computers
基 金:国家科技支撑计划资助项目(2012BAH83F00)
摘 要:针对传统关系型数据库海量地名数据检索效率低下的问题,提出了一种盘古分词和Lucene全文检索相结合的地名数据库快速检索方法。首先,设计了一种地名数据表结构,比较了几种常用开源分词器的中文分词性能,并选用性能优异的盘古中文分词器,通过扩展其词典来实现中文地名的有效分词。其次,利用内存索引和多线程并行处理技术提高Lucene创建倒排索引效率,并依据地名类别和显示优先级属性优化了检索结果相关度排序策略。最后,开发了一套具有快速搜索和地图定位展示的Web地名检索系统,使用500万条真实地名数据测试了其检索性能,查询平均耗时不到1s,比MySQL数据库模糊检索效率提高了15倍,匹配结果也更加准确,能够提供高效灵活的海量地名公共检索服务。To avoid the low efficiency in massive place names searching in the traditional relational database, this paper proposed a fast place name database retrieval method with the integration of PanGuAnalyzer and Lucene full-text search toolbox. Firstly, it designed a place name data structure, and compared the segmentation performances of several open source Chinese analyzers. Based on the results, it integrated the excellent PanguAnalyzer with a rich place dictionary into Lucene so as to improve the effect of Chinese place name segmentation. To improve the efficiency of creating inverted index, it adopted memory index and multi-thread parallel processing. It also optimized the query result ranking strategy based on similarity scoring ac- cording to the category and display priority attributes of place names. Finally, it developed a place name searching system, which integrated various functions including place name searching, visualization, and location service. More than 5 000 000 real place name records were used to test the performance of the new searching technique. By comparing with the searching results of fuzzy query method based on MySQL database, the average response time of the new method was less than one second, and it was nearly fifteen times faster than the database retrieval. The new proposed full-text search strategy demonstrates its advantage in terms of accuracy and rapid response, and it can provide efficient and flexible public place name search service.
关 键 词:LUCENE 地名 全文检索 数据库 中文分词 相关度排序
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49