检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王晓 罗永莲 WANG Xiao;LUO Yong-lian(School of Information Technology & Engineering,Jinzhong University,Jinzhong Shanxi,030619,China)
机构地区:[1]晋中学院信息技术与工程学院
出 处:《晋中学院学报》2019年第3期66-71,共6页Journal of Jinzhong University
基 金:山西省教育科学“十三五”规划课题:“基于创新创业教育理念的大数据相关专业教学模式研究”(GH-18091);晋中学院教学改革创新项目:“创新创业教育融入数据科学和大数据技术专业教育的案例研究”(Jg201807)
摘 要:针对新闻网页文本处理问题,提出了一种基于决策树抽取新闻标题并利用单元距离识别正文的方法.该方法将文本相似度、网页标记和属性作为决策树节点选择的测试属性项,各属性项的信息熵计算同时考虑了与标题相关和不相关的因素,在此基础上建立决策树,并根据规则定位新闻标题.利用网页标记的嵌套特征,缩小查找范围,根据网页各信息块间的显著边界定位新闻正文.实验结果表明,该方法抽取新闻标题的准确率在87%以上,抽取正文的平均准确率达到76%,对其他网页文本处理具有一定借鉴意义.Concerning the processing of news web pages,an extracted news headline and text method based on decision trees and unit distance was proposed.Text similarity,web page tags and attributes were taken as the test of node selection in decision tree.The feature information entropy was calculated with the title related and unrelated factors.On this basis,a decision tree was established and news headlines were located according to rules.By reducing searching range according to nesting of web pages,the news text was located on the basis of information between visual block of web pages.Experimental results show that the proposed method extracts news headlines with an accuracy rate of more than 87 percent and extracts news texts with an 76 percent average accuracy rate.The method is for reference to other kind of text processing of web page.
关 键 词:信息增益 决策树 新闻网页 内容抽取 网页信息块
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222