检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨冬菊[1,2] 胡成富 YANG Dongju;HU Chengfu(School of Information Science and Technology,North China University of Technology,Beijing 100144,China;Beijing Key Laboratory on Integration and Analysis of Large‑Scale Stream Data(North China University of Technology),Beijing 100144,China)
机构地区:[1]北方工业大学信息学院,北京100144 [2]大规模流数据集成与分析技术北京市重点实验室(北方工业大学),北京100144
出 处:《计算机应用》2024年第6期1720-1726,共7页journal of Computer Applications
基 金:广州市科技计划项目(202206030009)。
摘 要:针对科技文本关键词抽取任务中抽取出现次数少但能较好表达文本主旨的词语效果差的问题,提出一种基于改进TextRank的关键词抽取方法。首先,利用词语的词频-逆文档频率(TF-IDF)统计特征和位置特征优化共现图中词语间的概率转移矩阵,通过迭代计算得到词语的初始得分;然后,利用K-Core(K-Core decomposition)算法挖掘KCore子图得到词语的层级特征,利用平均信息熵特征衡量词语的主题表征能力;最后,在词语初始得分的基础上融合层级特征和平均信息熵特征,从而确定关键词。实验结果表明,在公开数据集上,与TextRank方法和OTextRank(Optimized TextRank)方法相比,所提方法在抽取不同关键词数量的实验中,F1均值分别提高了6.5和3.3个百分点;在科技服务项目数据集上,与TextRank方法和OTextRank方法相比,所提方法在抽取不同关键词数量的实验中,F1均值分别提高了7.4和3.2个百分点。实验结果验证了所提方法抽取出现频率低但较好表达文本主旨关键词的有效性。Aiming at the poor extraction effect of words that appear less frequently but can better express the theme of the text in the keyword extraction task of scientific text,a keyword extraction method based on improved TextRank was proposed.Firstly,the Term Frequency-Inverse Document Frequency(TF-IDF)statistical features and positional features of the words were used to optimize the probability transfer matrix between the words in the co-occurrence graph,and the initial scores of the words were obtained through iterative computation.Then,K-Core(K-Core decomposition)algorithm was used to mine the K-Core subgraphs to get the hierarchical features of the words,and the average information entropy feature was used to measure the thematic representation ability of the words.Finally,on the basis of the initial score of the word,the hierarchical feature and the average information entropy feature were fused to determine the keyword.The experimental results show that:on the public dataset,compared with the TextRank method and the OTextRank(Optimized TextRank)
关 键 词:科技文本 关键词抽取 TextRank K-Core图 平均信息熵
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49