基于词或词组长度和频数的短中文文本关键词提取算法  被引量:14

Keyword Extraction Algorithm Based on Length and Frequency of Words or Phrases for Short Chinese Texts

在线阅读下载全文

作  者:陈伟鹤[1] 刘云[1] 

机构地区:[1]江苏大学计算机科学与通信工程学院,镇江212013

出  处:《计算机科学》2016年第12期50-57,共8页Computer Science

基  金:国家自然科学基金项目(61300228);江苏省教育厅自然科学基金(09KJB520003)资助

摘  要:中文文本的关键词提取是自然语言处理研究中的难点。国内外大部分关键词提取的研究都是基于英文文本的,但其并不适用于中文文本的关键词提取。已有的针对中文文本的关键词提取算法大多适用于长文本,如何从一段短中文文本中准确地提取出具有实际意义且与此段中文文本的主题密切相关的词或词组是研究的重点。提出了面向中文文本的基于词或词组长度和频数的关键词提取算法,此算法首先提取文本中出现频数较高的词或词组,再根据这些词或词组的长度以及在文本中出现的频数计算权重,从而筛选出关键词或词组。该算法可以准确地从中文文本中提取出相对重要的词或词组,从而快速、准确地提取此段中文文本的主题。实验结果表明,基于词或词组长度和频数的中文文本关键词提取算法与已有的其他算法相比,可用于处理中文文本,且具有更高的准确性。Keyword extraction for Chinese text is an important and difficult part of the text processing research, espe- cially in the field of natural language processing research. Most existing studies focus on English text or long Chinese text, but due to their nature limitations, those keyword extraction algorithms can not apply to Chinese text. Those key- word extraction algorithms for English text are unsuitable for extracting keywords from Chinese texts. How to extract words or phrases accurately from Chinese text which are meaningful and closely related to the topics of this paragraph is the point of this paper. This paper presented a novel keyword extraction algorithm based on length and frequency of words or phrases for Chinese texts. This algorithm firstly extracts words or phrases with high frequency in the paragraph, then calculates the weight of the words or phrases according to the frequency and length of these words or phrases. Lastly, according to their weights, keywords are filtered out. This algorithm can extract the relative important words or phrases from the Chinese text accurately,which can help us find out the theme of this section efficiently and accurately. Experimental results show that compared with other keyword extraction algorithms, the proposed keyword extraction algorithm can process Chinese text with higher accuracy.

关 键 词:关键词提取 中文文本处理 音译词 网络新词 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象