检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王凌霄 王弋波[1] 朱礼军[1] WANG Lingxiao;WANG Yibo;ZHU Lijun(Institute of Scientific and Technical Information of China,Beijing 100083)
出 处:《中国科技资源导刊》2023年第5期31-40,64,共11页China Science & Technology Resources Review
基 金:中国科学技术信息研究所创新研究基金资助项目“基于文本实体挖掘的新药发现领域人工智能技术应用识别方法”(QN2022-06)。
摘 要:关键短语抽取是一种识别目标文本中具有特殊价值的关键词组合的自然语言处理任务场景,对科技文献情报挖掘具有重要的实践价值。由于缺少足够的标注数据、知识库、预训练模型,针对前沿细分学科颠覆性内容的关键短语抽取还存在着许多挑战。将有限状态自动机概念引入关键短语抽取任务中,把关键短语的词性标注组合模式抽象为一系列有限状态自动机文法。这种基于词性自动机的无监督关键短语提取算法,能够在不依赖标注数据和高性能计算设备的条件下,通过高度自定义的词性组合模式,抽取不定长度的细分领域关键短语。这种算法具备运行速度快、环境依赖低、匹配模式多、提取效果好等特点。使用SemEval-2017数据集和智能新药发现领域的文献摘要作为测试数据,将研究所提出的算法和几种广泛应用的关键短语抽取算法进行对比。对比结果显示:这种算法在所有关键词中的准确率达到30.8%,召回率达到34.1%,F1值达到32.4%;在关键短语中的准确率达到30.8%,召回率达到52.0%,F1值达到38.7%。召回率指标与F1指标相比关键词抽取开源算法库有显著提升。Keyphrase extraction is a natural language processing task scenario for identifying keyword combinations with special value in target texts,which has important practical value in mining scientific and technological literature information.Due to the lack of sufficient labeled data,knowledge base,and pre-training models,there are still many practical challenges in the extraction of keyphrases for subversive content in cutting-edge sub-disciplines.In this paper,the concept of finite state automata is introduced into the key phrase extraction task,and the part-of-speech tagging combination patterns of keyphrases are abstracted into a series of finite state automata grammars.This unsupervised key phrase extraction algorithm based on part-of-speech automaton can extract keyphrases of indeterminate length in subdivision fields through a highly customized part-of-speech combination mode without relying on labeled data and high-performance computing equipment.The algorithm has the characteristics of fast running speed,low environment dependence,many matching modes,and good extraction effect.This paper uses the SemEval-2017 dataset and literature abstracts in the field of intelligent new drug discovery as test data,and compares the algorithm proposed in this paper with several widely used keyphrase extraction algorithms.The accuracy rate of this algorithm in all keywords reaches 30.8%,the recall rate reaches 34.1%,the F1 value reaches 32.4%,the accuracy rate in key phrases reaches 30.8%,the recall rate reaches 52.0%,and the F1 value reaches 38.7%.Compared with the open source algorithm library for keyword extraction,the recall score and the F1 score are significantly improved.
关 键 词:命名实体识别 关键词抽取 关键短语抽取 有限状态自动机 词性标注
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.118.171.161