基于义原相似度的关键词抽取方法  被引量:9

Extracting Keywords Based on Sememe Similarity

在线阅读下载全文

作  者:闫强 张笑妍[2] 周思敏 Yan Qiang;Zhang Xiaoyan;Zhou Simin(School of Modern Post(School of Automation),Beijing University of Posts and Telecommunications,Beijing 100876,China;School of Economics and Management,Beijing University of Posts and Telecommunications,Beijing 100876,China)

机构地区:[1]北京邮电大学现代邮政学院(自动化学院),北京100876 [2]北京邮电大学经济管理学院,北京100876

出  处:《数据分析与知识发现》2021年第4期80-89,共10页Data Analysis and Knowledge Discovery

基  金:国家社会科学基金重点项目(项目编号:17AGL026);北京邮电大学优秀博士生创新基金资助项目(项目编号:CX2019128)的研究成果之一。

摘  要:【目的】将词语的语义信息引入TextRank算法中,改进关键词抽取效果。【方法】使用HowNet知识库提供的词语义原信息计算词语相似度,根据设定的相似度阈值构建语义词图和矩阵。之后将语义矩阵和共现矩阵加权,得到新的词节点转移概率矩阵。【结果】改进后的算法在短文本上表现优于传统TextRank、TF-IDF和LDA,F值分别提高了6.6%、9.0%和10.3%;在长文本上表现逊于TF-IDF,与TextRank差别不大。【局限】分词程序对复合词、新词和实体类名词识别效果较差,使算法抽取到残缺的关键词,导致F值降低。此外,义原相似度算法也可进一步改进。【结论】结合语义的TextRank算法使关键词抽取过程兼顾词语共现及语义关系,为短文本关键词抽取提供了新思路。[Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF,but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words,new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.

关 键 词:TextRank 关键词抽取 义原 词语相似度 

分 类 号:TP393[自动化与计算机技术—计算机应用技术] G250[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象