文本摘要技术在网络爬虫中的应用  

Application of Text Summary Technology in Web Crawler

在线阅读下载全文

作  者:高巍 马辉 李大舟 王淮中 GAO Wei;MA Hui;LI Da-zhou;WANG Huai-zhong(Shenyang University of Chemical Technology,Shenyang 110142,China)

机构地区:[1]沈阳化工大学计算机科学与技术学院,辽宁沈阳110142

出  处:《沈阳化工大学学报》2022年第1期82-89,共8页Journal of Shenyang University of Chemical Technology

基  金:辽宁省教育厅科学技术研究项目(L2016011);辽宁省教育厅科学研究项目(LQ2017008);辽宁省博士启动基金项目(201601196)。

摘  要:传统的聚焦爬虫不能很好地处理所爬取的数据,很难从原始数据中提取有价值的信息,同时,大量冗余的数据对计算机的存储能力带来了挑战.本研究提出了一种基于抽取式文本摘要技术的爬虫算法,将改进的TextRank算法应用于网络爬虫中,从而解决用户如何在快速浏览和吸收特定领域新闻所有内容的同时节约计算机内存资源的问题.本研究用Glove模型训练数据集,对文本进行词向量表示,将k-means算法思想融入TextRank算法中,提出一种改进的TextRank模型.实验结果表明:提出的改进的TextRank模型抽取得到的摘要质量优于传统TextRank和TopicModel模型,其综合评价指标达到了52.21%,比TopicModel模型高10.29%,比传统TextRank模型高15.55%;结合了抽取式文本摘要技术的聚焦爬虫与传统聚焦爬虫爬取的文件占用空间比为1:12,解决了爬虫会占用大量计算机资源的问题.Traditional focused crawler can not deal with the crawled data well and it is difficult to extract valuable information from the original data.Meanwhile, a large amount of redundant data bring challenges to the storage capacity of the computer.This paper proposes a crawler algorithm based on extracted text summary technology, which applies extracted text summary technology to web crawler to solve the problem of how to quickly browse and absorb all the content of news in a specific field and save computer memory resources at the same time.Experimental results show that the quality of abstracts extracted by the improved TextRank algorithm proposed in this paper is better than the traditional TextRank and TopicModel algorithms, and its comprehensive evaluation index reaches 52.21%,10.29% higher than TopicModel algorithm and 15.55% higher than the traditional TextRank algorithm.The space ratio of crawling files between web crawler combined with extracted text summary technology and traditional crawler is 1∶12,which solves the problem that crawler will occupy a lot of computer resources.

关 键 词:TextRank算法 Glove模型 K-MEANS算法 抽取式文本摘要 聚焦爬虫 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象