基于词汇同现模型的关键词自动提取方法研究  被引量:4

A Method of Automatic Keyword Extraction based on Co-occurrence Model

在线阅读下载全文

作  者:肖红[1] 许少华[1] 

机构地区:[1]大庆石油学院计算机与信息技术学院,黑龙江大庆163318

出  处:《沈阳理工大学学报》2009年第5期38-41,共4页Journal of Shenyang Ligong University

基  金:国家自然科学基金资助项目(60473051);黑龙江省自然科学基金资助项目(11521013)

摘  要:关键词提取是中文信息处理的一个关键环节。提出一种关键词自动提取的有效方法,首先对普通词典进行了扩充,在普通词典的基础上结合大量的训练样本对词典进行训练得到一个带有TF×IDF值和互信息的优化词典。然后在此词典上按段落进行切词,对切词结果集根据词频、权重、同现关系和互信息排序后筛选出候选关键词。最后根据候选词的上位词和下位词进行词汇合并,设定一个阀值,取出其中的n个词作为文章的关键词。通过小数据测试样本集的抽取实验结果表明,文中方法在一定程度上能够提高关键词提取的正确率,得到了较为满意的效果.Keyword Extraction is a key problem in Chinese language processing. Firstly, an effective way for automatically extracting keywords was proposed in this paper, which extends the normal dic- tionary and constructs an optimum one with the TF × IDF and MI factor in vocabulary by training massive sample data sets on the base of normal dictionary. Secondly, based on the optimum diction- ary, all segment word items are sorted and the candidate words are selected in terms of the word frequency, weight, co-occurrence relationship and MI factor. With application of the candidate word's epigynous and hypogynous, the word items are merged. Finally, by setting a threshold that confined the number of keywords, the final keywords of document are obtained. It is shown by the experimental results that the method can improve the accuracy of automatic keywords extraction in certain extent, and that the more satisfied results are presented in min data-set.

关 键 词:关键词自动提取 同现关系 互信息 TF×IDF 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象