面向中文学术文本的单文档关键短语抽取  被引量:5

Extracting Key-phrases from Chinese Scholarly Papers

在线阅读下载全文

作  者:夏天[1,2] Xia Tian(Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China,Beijing 100872,China;School of Information Resource Management,Renmin University of China,Beijing 100872,China)

机构地区:[1]中国人民大学数据工程与知识工程教育部重点实验室,北京100872 [2]中国人民大学信息资源管理学院,北京100872

出  处:《数据分析与知识发现》2020年第7期76-86,共11页Data Analysis and Knowledge Discovery

基  金:国家社会科学基金重大项目“大数据环境下政务信息资源归档与管理研究”(项目编号:17ZDA293)的研究成果之一。

摘  要:【目的】自动抽取中文学术文本中的关键短语,为学术文本挖掘提供短语级别的概念表达。【方法】引入内部凝聚度和边界自由度两个指标,分别度量短语内部的紧密程度和短语边界的自由组配能力,实现中文双词短语的权威度计算,并与位置加权关键词抽取结果进行融合排序,在此基础上选取TopN个元素生成关键短语。【结果】在构建的中文学术论文数据集上,关键短语抽取算法PhraseRank在准确率、召回率和考虑排序位置的R-MAP评价指标方面,均大幅度优于传统的关键词抽取算法WordRank,其中,R-MAP值相对提升超过了128%。【局限】未识别三个及以上词语构成的关键短语。【结论】相比于关键词,PhraseRank抽取得到的关键短语,与人工标记结果的一致性更高,更能体现中文学术文本的概念表达特点。[Objective] This paper propose a new method to extract key-phrases from Chinese scholarly articles,aiming to provide concept representation at phrase level for academic text mining. [Methods] First, we introduced the cohesion and freedom concepts to measure the internal tightness of phrases and free collocation ability of boundary words. It helped us compute the authority of bi-word phrases. Then, we merged our list with phrases extracted by position-weighted method. Finally, the TopN elements were retrieved as the final key phrases.[Results] We examined the proposed PhraseRank method with Chinese academic papers, and found its precision,recall and R-MAP values were significantly higher than those of the traditional WordRank algorithm. Among them, the R-MAP value increased by more than 128%. [Limitations] Our method could not identify key phrases with three or more words. [Conclusions] The keyphrases extracted by PhraseRank, which are more consistent with manually labeled results than keywords, effectively describe characteristics of Chinese scholarly papers.

关 键 词:关键短语抽取 学术文本挖掘 TextRank 词图 

分 类 号:G353[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象