一种基于逆序匹配重复模式的主题信息提取方法

A THEME INFORMATION EXTRACTION METHOD BASED ON REPETITIVE PATTERN REVERSE MATCHING

机构地区：[1]广东工贸职业技术学院计算机工程系,广东广州510510 [2]中山大学信息科学与技术学院,广东广州510006

出　　处：《计算机应用与软件》2013年第4期88-91,共4页Computer Applications and Software

基　　金：国家自然科学基金项目(61003045)

摘　　要：网页中的信息主要以重复的HTML结构进行组织并形成一致的展现形式,主要研究具备复杂重复模式的网页主题信息块识别,提出一种改进的基于逆序匹配重复模式的算法。该算法依据HTML标签结构和class属性改进DOM树,重构页面的向量空间模型,逆序匹配重复结构模式并完成对主题信息的提取。实验结果表明,该方法能准确识别复杂页面结构中主题重复模式,有效避免非主题重复模式的干扰,有较好的召回率和准确率。The information in webpage is mainly arranged with repetitive HTML structure and presents in consistent display style.In the paper we put emphasis on studying the recognition of the webpage theme information with complicated repetitive pattern and propose an improved algorithm which is based on repetitive pattern reverse matching.The method improves document tree model in accordance with HTML tag structure and class property,reconstructs vector space model of the pages,reversely matches the repetitive structure pattern and then completes the extraction of the theme information.Experimental results suggest that this method can precisely recognise the theme repetitive pattern in complicated webpage structure,effectively avoid the disturbance from non-theme repetitive pattern blocks and performs well in precision and recall.

关键词：信息提取重复模式主题识别逆序匹配

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于逆序匹配重复模式的主题信息提取方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于逆序匹配重复模式的主题信息提取方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索