基于决策树与单元距离抽取新闻网页内容  

Content Extraction from News Web Pages Based on Decision Trees and Unit Distance

在线阅读下载全文

作  者:王晓 罗永莲 WANG Xiao;LUO Yong-lian(School of Information Technology & Engineering,Jinzhong University,Jinzhong Shanxi,030619,China)

机构地区:[1]晋中学院信息技术与工程学院

出  处:《晋中学院学报》2019年第3期66-71,共6页Journal of Jinzhong University

基  金:山西省教育科学“十三五”规划课题:“基于创新创业教育理念的大数据相关专业教学模式研究”(GH-18091);晋中学院教学改革创新项目:“创新创业教育融入数据科学和大数据技术专业教育的案例研究”(Jg201807)

摘  要:针对新闻网页文本处理问题,提出了一种基于决策树抽取新闻标题并利用单元距离识别正文的方法.该方法将文本相似度、网页标记和属性作为决策树节点选择的测试属性项,各属性项的信息熵计算同时考虑了与标题相关和不相关的因素,在此基础上建立决策树,并根据规则定位新闻标题.利用网页标记的嵌套特征,缩小查找范围,根据网页各信息块间的显著边界定位新闻正文.实验结果表明,该方法抽取新闻标题的准确率在87%以上,抽取正文的平均准确率达到76%,对其他网页文本处理具有一定借鉴意义.Concerning the processing of news web pages,an extracted news headline and text method based on decision trees and unit distance was proposed.Text similarity,web page tags and attributes were taken as the test of node selection in decision tree.The feature information entropy was calculated with the title related and unrelated factors.On this basis,a decision tree was established and news headlines were located according to rules.By reducing searching range according to nesting of web pages,the news text was located on the basis of information between visual block of web pages.Experimental results show that the proposed method extracts news headlines with an accuracy rate of more than 87 percent and extracts news texts with an 76 percent average accuracy rate.The method is for reference to other kind of text processing of web page.

关 键 词:信息增益 决策树 新闻网页 内容抽取 网页信息块 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象