基于词性自动机的关键短语抽取方法  

Keyphrase Extraction Algorithm via Tagging Finite Automation

在线阅读下载全文

作  者:王凌霄 王弋波[1] 朱礼军[1] WANG Lingxiao;WANG Yibo;ZHU Lijun(Institute of Scientific and Technical Information of China,Beijing 100083)

机构地区:[1]中国科学技术信息研究所,北京100038

出  处:《中国科技资源导刊》2023年第5期31-40,64,共11页China Science & Technology Resources Review

基  金:中国科学技术信息研究所创新研究基金资助项目“基于文本实体挖掘的新药发现领域人工智能技术应用识别方法”(QN2022-06)。

摘  要:关键短语抽取是一种识别目标文本中具有特殊价值的关键词组合的自然语言处理任务场景,对科技文献情报挖掘具有重要的实践价值。由于缺少足够的标注数据、知识库、预训练模型,针对前沿细分学科颠覆性内容的关键短语抽取还存在着许多挑战。将有限状态自动机概念引入关键短语抽取任务中,把关键短语的词性标注组合模式抽象为一系列有限状态自动机文法。这种基于词性自动机的无监督关键短语提取算法,能够在不依赖标注数据和高性能计算设备的条件下,通过高度自定义的词性组合模式,抽取不定长度的细分领域关键短语。这种算法具备运行速度快、环境依赖低、匹配模式多、提取效果好等特点。使用SemEval-2017数据集和智能新药发现领域的文献摘要作为测试数据,将研究所提出的算法和几种广泛应用的关键短语抽取算法进行对比。对比结果显示:这种算法在所有关键词中的准确率达到30.8%,召回率达到34.1%,F1值达到32.4%;在关键短语中的准确率达到30.8%,召回率达到52.0%,F1值达到38.7%。召回率指标与F1指标相比关键词抽取开源算法库有显著提升。Keyphrase extraction is a natural language processing task scenario for identifying keyword combinations with special value in target texts,which has important practical value in mining scientific and technological literature information.Due to the lack of sufficient labeled data,knowledge base,and pre-training models,there are still many practical challenges in the extraction of keyphrases for subversive content in cutting-edge sub-disciplines.In this paper,the concept of finite state automata is introduced into the key phrase extraction task,and the part-of-speech tagging combination patterns of keyphrases are abstracted into a series of finite state automata grammars.This unsupervised key phrase extraction algorithm based on part-of-speech automaton can extract keyphrases of indeterminate length in subdivision fields through a highly customized part-of-speech combination mode without relying on labeled data and high-performance computing equipment.The algorithm has the characteristics of fast running speed,low environment dependence,many matching modes,and good extraction effect.This paper uses the SemEval-2017 dataset and literature abstracts in the field of intelligent new drug discovery as test data,and compares the algorithm proposed in this paper with several widely used keyphrase extraction algorithms.The accuracy rate of this algorithm in all keywords reaches 30.8%,the recall rate reaches 34.1%,the F1 value reaches 32.4%,the accuracy rate in key phrases reaches 30.8%,the recall rate reaches 52.0%,and the F1 value reaches 38.7%.Compared with the open source algorithm library for keyword extraction,the recall score and the F1 score are significantly improved.

关 键 词:命名实体识别 关键词抽取 关键短语抽取 有限状态自动机 词性标注 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象