检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:彭圳生 巩青歌 高志强 段妍羽 曾子贤 PENG Zhensheng;GONG Qingge;GAO Zhiqiang;DUAN Yanyu;ZENG Zixian(Institute of Information Engineering,Engineering University of PAP,Xi'an,Shaanxi 710086,China;Key Laboratory of Military Big Data and Cloud Computing,Xi'an,Shaanxi 710086,China)
机构地区:[1]武警工程大学信息工程学院,陕西西安710086 [2]军队大数据与云计算重点实验室,陕西西安710086
出 处:《中文信息学报》2018年第10期78-86,共9页Journal of Chinese Information Processing
基 金:陕西省中国青年自然科学基金(2015JQ6224)
摘 要:为从大量的复杂非规范网页结构中自动抽取出新闻标题,该文提出一种基于密度和文本特征的新闻标题抽取算法(title extraction with density and text-features,TEDT)。主要通过融合网页文本密度分布和语言特征的语料判定模型,将网页划分为语料区和标题候选区,选取语料后通过TextRank算法计算对应的key-value权重集合,最后采用改进的相似度计算方法从标题候选区抽取新闻标题。该算法能有效划分语料和标题区域,降低网页噪声干扰,准确抽取出新闻标题。实验结果表明,TEDT的准确率和召回率均优于传统的基于规则和相似度的新闻标题抽取算法,证明了TEDT不仅对主流新闻网站有效,而且对复杂非规范网页也广泛适用。In order to extract news title automatically from large amounts of complex and nonstandard Web pages,this paper proposes a news title extraction algorithm based on density and text features(TEDT).A corpus decision model is presented by combining the text density distribution and language features of a Web page.The model divides the Web page into corpus area and candidate title candidate area,and then the corresponding key-value weight set is calculated by TextRank algorithm after selecting the corpus.An improved similarity calculation method is finally applied to extract news title.The experimental result shows that the accuracy rate and recall rate of TEDT are better than the traditional news title algorithm based on rules and similarity.It is also proved that TEDT is not only effective for mainstream news websites,but also widely applicable to complex and nonstandard Web pages.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222