基于密度及文本特征的新闻标题抽取算法  被引量:6

News Title Extraction Algorithm Based on Density and Text-features

在线阅读下载全文

作  者:彭圳生 巩青歌 高志强 段妍羽 曾子贤 PENG Zhensheng;GONG Qingge;GAO Zhiqiang;DUAN Yanyu;ZENG Zixian(Institute of Information Engineering,Engineering University of PAP,Xi'an,Shaanxi 710086,China;Key Laboratory of Military Big Data and Cloud Computing,Xi'an,Shaanxi 710086,China)

机构地区:[1]武警工程大学信息工程学院,陕西西安710086 [2]军队大数据与云计算重点实验室,陕西西安710086

出  处:《中文信息学报》2018年第10期78-86,共9页Journal of Chinese Information Processing

基  金:陕西省中国青年自然科学基金(2015JQ6224)

摘  要:为从大量的复杂非规范网页结构中自动抽取出新闻标题,该文提出一种基于密度和文本特征的新闻标题抽取算法(title extraction with density and text-features,TEDT)。主要通过融合网页文本密度分布和语言特征的语料判定模型,将网页划分为语料区和标题候选区,选取语料后通过TextRank算法计算对应的key-value权重集合,最后采用改进的相似度计算方法从标题候选区抽取新闻标题。该算法能有效划分语料和标题区域,降低网页噪声干扰,准确抽取出新闻标题。实验结果表明,TEDT的准确率和召回率均优于传统的基于规则和相似度的新闻标题抽取算法,证明了TEDT不仅对主流新闻网站有效,而且对复杂非规范网页也广泛适用。In order to extract news title automatically from large amounts of complex and nonstandard Web pages,this paper proposes a news title extraction algorithm based on density and text features(TEDT).A corpus decision model is presented by combining the text density distribution and language features of a Web page.The model divides the Web page into corpus area and candidate title candidate area,and then the corresponding key-value weight set is calculated by TextRank algorithm after selecting the corpus.An improved similarity calculation method is finally applied to extract news title.The experimental result shows that the accuracy rate and recall rate of TEDT are better than the traditional news title algorithm based on rules and similarity.It is also proved that TEDT is not only effective for mainstream news websites,but also widely applicable to complex and nonstandard Web pages.

关 键 词:标题抽取 密度分布 文本特征 信息检索 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象