基于标题机器学习的网页分割方法被引量：1

Novel Method of Web Page Segmentation Based on Title Machine Learning

作　　者：李进生[1] 乐惠骁童名文 LI Jin -sheng1,LE Hui- xiao2, TONG Ming -wen2(1 Modern Education Technical Center, The Open University of Wuhan, Wuhan 430033, China;2School of Education Information Technology, Central China Normal University, Wuhan 430079, Chin)

机构地区：[1]武汉市广播电视大学现代教育技术中心,武汉430033 [2]华中师范大学教育信息技术学院,武汉430079

出　　处：《计算机科学》2018年第B06期583-587,共5页Computer Science

基　　金：教育部人文社科基金资助项目:数字化学习资源无障碍适配决策模型研究(15YJA880062)资助

摘　　要：针对已有网页分割方法都基于文档对象模型实现且实现难度较高的问题,提出了一种采用字符串数据模型实现网页分割的新方法。该方法通过机器学习获取网页标题的特征,利用标题实现网页分割。首先,利用网页行块分布函数和网页标题标签学习得到网页标题特征;然后,基于标题将网页分割成内容块;最后,利用块深度对内容块进行合并,完成网页分割。理论分析与实验结果表明,该方法中的算法具有O(n)的时间复杂度和空间复杂度,该方法对于高校门户、博客日志和资源网站等类型的网页具有较好的分割效果,并且可以用于网页信息管理的多种应用中,具有良好的应用前景。To solve the problem that it is difficult to implement the web page segmentation method based on document object model（DOM）,a novel method was proposed through employing string model.The feature of the title of a web page is dug out by machine learning.Based on the found title,the web page is segmented.Firstly,the titles in web pages are picked up by the information of liner block function and title tag.Secondly,web pages are partitioned into content blocks by using the titles.Finally,the content blocks are merged by block depth information.It is proved that the complexity of algorithms in the method are O（n）,and the method is suitable for web pages in the university portal,blog and resource web sites.The method is useful for many applications in web page information management,and it has a good prospect.

关键词：网页分割标题行块分布函数块深度机器学习

分类号：TP37[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于标题机器学习的网页分割方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于标题机器学习的网页分割方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于标题机器学习的网页分割方法被引量：1