TF-IDF与规则相结合的中文关键词自动抽取研究  被引量:35

TF-IDF and Rules Based Automatic Extraction of Chinese Keywords

在线阅读下载全文

作  者:牛萍[1] 黄德根[1] 

机构地区:[1]大连理工大学计算机学院,辽宁大连116024

出  处:《小型微型计算机系统》2016年第4期711-715,共5页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(61173100;61173101;61272375)资助

摘  要:关键词的抽取广泛应用于自然语言处理过程中.对于中文关键词抽取,分词结果及候选词的选取严重影响后期的抽取结果.针对候选词的选取,提出一种连续单字未登录词识别和多词短语识别的方法来进行候选词选择,可以较好的识别出频率大于1的未登录词,且不依赖于语料库规模和领域.并且,在传统的TF-IDF基础上,结合位置特征和长度特征的情况下,考虑兼类词的不同词性问题,提出改进的TF-IDF计算公式,进行关键词抽取.通过比较实验,证明了候选词对关键词抽取的影响,与TF-IDF进行比较实验,改进的TF-IDF的准确率提高了5%左右.Keywords extraction is widely used in natural language processing.For Chinese keyword extraction,the selection of candidate words affects the final result of keywords extraction.This paper proposes a method to recognize unknown words that consist of continuous individual chinese characters and muti-words phrases.The method can better identify the unknown word whose frequency is greater than one without depending on the scale and scope of the corpus.Considering of the words with different part of speeches and word's position and length,keywords and key phrases extraction is completed based on a newmethod which adds those features to traditional TF-IDF.With comparision exteriments,it shows that the affection of candidate words.Compared to the traditional TF-IDF,the value of P,R and F of the improved TD-IDF method improves about 5%.

关 键 词:抽取 未登录词识别 候选词抽取 TF-IDF 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象