检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]河南财经政法大学计算机与信息工程学院,郑州450002 [2]武汉大学计算机学院,武汉430072
出 处:《计算机学报》2015年第2期349-364,共16页Chinese Journal of Computers
基 金:国家自然科学基金(61272109;61202285);国家星火计划项目(2012GA750007);河南省科技厅基础与前沿技术研究项目(122300410378);河南省教育厅科学技术研究重点项目(13A520032)资助~~
摘 要:获取Web页面中的重要内容如文本和链接,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用Web页面分割和区块识别的方法.但现有的方法将Web页面中重要文本和链接的识别视为两个相互独立的问题,这种做法忽略了Web页面中文本和链接的内在语义关系,同时降低了页面处理的效率.文中提出了一种Web页面重要内容挖掘的统一框架,该框架主要由3个部分组成:第一,先将Web页面转换为DOM树表示,然后采用节点密度熵为度量将DOM树分割为不同的页面块;第二,采用基于K最近邻标签传播的半监督方法自动扩展页面块训练集;第三,在扩展的页面块训练集上对SVM分类器进行训练,并用来对页面块进行分类.采用该框架可以将Web页面块区分为多种类型,并且该框架独立于Web页面的类型和布局.我们在真实的Web环境下进行了广泛的实验,实验结果表明了该方法的有效性.For many research fields in Web mining, how to get the important content in a Web page, such as texts and links, has important applications. At present, the main method for solving this problem is to adopt Web page segmentation and informative sections recognition. However, existing approaches use decoupled strategies that attempt to do text content and link content identification in two separate phases. This ignores the inner semantic relationships between texts and links in a Web page, and also results in low efficiency of the processing of Web page. In this paper, we propose a uniform framework for mining important content in a Web page. This framework consists of three components. First, a Web page is transformed into a DOM tree, and then it is segmented into several Web page blocks with a metric based on node density entropy. Second, a semi-supervised approach based on K-Nearest Neighbor label propagation is proposed to automatically extend the training set for classification. Third, a SVM-based classifier is trained over the extended training set, and eventually it is leveraged to classify Web page blocks. The framework can distinguish Web page blocks into a variety of types, and it is independent of the type and layout of Web pages. We conduct the extensive experiment over real Web environment, and the experimental results show that the proposed methods are effective.
关 键 词:页面分割 节点密度 标签传播 DOM树 块分类 社会计算 社交网络
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.190.239.193