基于结构驱动的网络论坛采集路径研究被引量：1

Structure-driven based traversal strategy for Web forum crawling

作　　者：李恒训[1,2] 李南波[3] 邱泳钦[2,4] 徐燕[2,4] 刘金刚[1,2]

机构地区：[1]首都师范大学计算机联合研究院,北京100048 [2]中国科学院计算技术研究所,北京100190 [3]湘潭大学信息工程学院,湖南湘潭411105 [4]北京语言大学信息科学学院,北京100190

出　　处：《计算机应用研究》2011年第9期3284-3287,共4页Application Research of Computers

基　　金：国家自然科学基金资助项目(60873166);国家教育部科学技术研究重点资助项目(109028);北京市教育科学基金资助项目(AHA09110)

摘　　要：网络论坛中蕴涵着大量具有实用价值和商业价值的信息,是搜索引擎和问答系统信息的重要来源。针对论坛结构复杂、链接种类繁多,以及容易陷入采集陷阱等问题,提出了一种基于结构驱动的采集路径选择方法。首先根据用户标注的少量类型数据,利用DOM树对采样网页基于网页结构进行结构聚类;其次根据各节点的评价进行采集路径选择;最后对翻页链接进行有效的识别和处理。实验表明,该方法采集的覆盖率和有效率明显优于传统算法,并且应用在中国科学院计算所舆情监测平台上取得了良好的效果。Forums contain much practical and business information,which is the important source of information for search engines and question answering system.Complex structure of the forums,a great variety of links and the issues that being easy to fall into the trap of crawling are all the problems when collect information.This paper proposed a crawling method based on structure-driven path selection to solve these problems.First,used a small number of types of data marked by the users,and used DOM tree to cluster by structure based on Web-based structure.And then,chose the collected route according to the evaluation of each node,at last identified and processed the link to the page effectively.Experiments show that the coverage and efficiency of collection is better than the traditional algorithm.And get good results through the golaxy public opinion monitoring system of ICT.

关键词：信息检索论坛采集结构驱动聚类路径选择

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于结构驱动的网络论坛采集路径研究被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于结构驱动的网络论坛采集路径研究 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于结构驱动的网络论坛采集路径研究被引量：1