检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]大庆石油学院计算机与信息技术学院,黑龙江大庆163318
出 处:《沈阳理工大学学报》2009年第5期38-41,共4页Journal of Shenyang Ligong University
基 金:国家自然科学基金资助项目(60473051);黑龙江省自然科学基金资助项目(11521013)
摘 要:关键词提取是中文信息处理的一个关键环节。提出一种关键词自动提取的有效方法,首先对普通词典进行了扩充,在普通词典的基础上结合大量的训练样本对词典进行训练得到一个带有TF×IDF值和互信息的优化词典。然后在此词典上按段落进行切词,对切词结果集根据词频、权重、同现关系和互信息排序后筛选出候选关键词。最后根据候选词的上位词和下位词进行词汇合并,设定一个阀值,取出其中的n个词作为文章的关键词。通过小数据测试样本集的抽取实验结果表明,文中方法在一定程度上能够提高关键词提取的正确率,得到了较为满意的效果.Keyword Extraction is a key problem in Chinese language processing. Firstly, an effective way for automatically extracting keywords was proposed in this paper, which extends the normal dic- tionary and constructs an optimum one with the TF × IDF and MI factor in vocabulary by training massive sample data sets on the base of normal dictionary. Secondly, based on the optimum diction- ary, all segment word items are sorted and the candidate words are selected in terms of the word frequency, weight, co-occurrence relationship and MI factor. With application of the candidate word's epigynous and hypogynous, the word items are merged. Finally, by setting a threshold that confined the number of keywords, the final keywords of document are obtained. It is shown by the experimental results that the method can improve the accuracy of automatic keywords extraction in certain extent, and that the more satisfied results are presented in min data-set.
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.112