中文文本的主题关键短语提取技术  被引量:5

Technology of Extracting Topical Keyphrases from Chinese Corpora

在线阅读下载全文

作  者:杨玥 张德生[1] 

机构地区:[1]西安理工大学理学院,西安710054

出  处:《计算机科学》2017年第B11期432-436,共5页Computer Science

摘  要:在大数据时代,信息量暴增,人们接触最多的信息就是文本信息,每天在互联网上都有无数文本信息被上传或下载。快速掌握这些文本信息内容的重要方法之一就是关键词提取。然而,在传统关键词提取算法中,通常忽略了两个重要的方面:词语长度和文本主题。针对以上两方面问题,提出了提取中文文本的主题关键短语技术。将LDA主题模型与频繁短语发现算法相结合,生成不同长度的频繁候选短语;然后,利用所提的完整性筛选和排序函数对候选短语进行筛选和排序;最后,根据排序结果选择最终的主题关键短语。In the big data era, the informat ion is exploding. The most popular informat ion among people connection is text message. On the Internet , there are countless text informat ion upload or download every day. The important way to quickly grasp content of countless text message is extracting keywords. However, the tradit ional work of extracting keywords from text corpora ignores two problems: the length of keywords and the topic of text corpora. In this paper,a new algorithm which is in consideration of two aspects mentioned above was proposed. This paper combined the LDA topic model and frequent phrases discovery algorithm to generate frequent candidate phrases with dif ferent length, at the same time, this paper proposed an algorithm of completeness filter and rank funct ion to filt and rank candidate. Final ly, according to the rank list,the real keyphrases were chosen.

关 键 词:关键词提取 LDA主题模型 频繁短语 完整性筛选 排序函数 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象