面向农业科研办公的垂直搜索引擎研究与设计  被引量:1

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

在线阅读下载全文

作  者:李昀 邓颖 吴华瑞[2,3,4] LI Yun;DENG Ying;WU Hua-rui(Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China;National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China;Beijing Research Center for Information Technology in Agriculture, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China;Key Laboratory of Agri-informatics, Ministry of Agriculture, Beijing 100097, China)

机构地区:[1]北京市农林科学院,北京100097 [2]国家农业信息化工程技术研究中心,北京100097 [3]北京市农业信息技术研究中心,北京100097 [4]农业农村部农业信息技术重点实验室,北京100097

出  处:《西南师范大学学报(自然科学版)》2020年第9期43-50,共8页Journal of Southwest China Normal University(Natural Science Edition)

基  金:2020年度农业农村部农业信息技术重点实验室建设项目(PT2020-03).

摘  要:在农业科研办公过程中,科研人员进行信息检索的频率高,信息需求精度高,但传统的综合性搜索引擎检索农业实用技术、政策法规、专题数据等方向性比较强的农业信息,通常返回结果数据量庞大、主旨范围宽泛,导致内容不精准、搜索面太广,筛选结果专业性不足;且现阶段主流的农业领域的垂直搜索引擎的搜索策略主要建立在传统的文本检索上,在自身领域数据量有限的情况下,搜索结果查全率不高,且搜索结果没有排序依据(大多仅仅按信息发生时间为排序依据).本文对农业互联网信息搜索引擎进行了研究,通过对各级农业管理部门网站、农业科研院所网站、农业新闻网站、农业商业网站等数据源的模块进行定位,通过爬虫进行数据更新检测与定时抓取,从数据源上有效减少不相关信息;基于数百个互联网数据源农业相关模块的信息抽取,采用word2vec和本文提出的基于文本特征表达的doc2vec,分别创建农业词向量、文档向量空间,用来应对搜索关键词为无序词组和有序语句的搜索场景,确保垂直搜索的智能和返回结果的准确.经过实验验证,本文提出的doc2vec+tf-idf搜索算法能够在有序搜索中达到较高的准确率,结合word2vec进行的无序搜索,有针对地进行语义搜索,可以进一步提高搜索引擎的查准率,满足日益增长的对农业领域信息搜索的高效高质的需求.The disadvantage of using traditional comprehensive search engines in Agricultural area is that they returns too many results which are not accurate enough to match the requirement of the agricultural scientific research office due to its non-limited search coverage and using improperly semantic association algorithms.In this article,an Agricultural Web-Info Gathering system monitors have been mentioned,updated information been gathered and accumulated from specific modules of series of agricultural websites such as official websites of national and local agriculture management departments,official websites of agricultural college or research institutes,agriculture magazines websites,and agriculture commercial websites.Specification of data resource reduces non-related data,efficiently limited the search range.The search engine utilized word2vec and text feature based doc2vec models and took data of agriculture oriented websites as text corpus to build word vector space and document vector space to deal with non-ordered words set search and ordered sentence or paragraph search,in order to ensure the search result to be accurate as well as intelligent.According to the result of experiment it is proved that this system with doc2vec+tf-idf search algorithm has higher accuracy in sequential search for agricultural information.With the high performance of word2vec algorithm in nonsequential search,dynamically choosing corresponding algorithm for sequential/nonsequential search could further improve the accuracy of the search engine,and satisfied high quality data resource requirement of Agricultural information.

关 键 词:农业信息搜索引擎 语义相似度 word2vec doc2vec TF-IDF 文本智能搜索 

分 类 号:S126[农业科学—农业基础科学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象