检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《计算机科学》2017年第B11期432-436,共5页Computer Science
摘 要:在大数据时代,信息量暴增,人们接触最多的信息就是文本信息,每天在互联网上都有无数文本信息被上传或下载。快速掌握这些文本信息内容的重要方法之一就是关键词提取。然而,在传统关键词提取算法中,通常忽略了两个重要的方面:词语长度和文本主题。针对以上两方面问题,提出了提取中文文本的主题关键短语技术。将LDA主题模型与频繁短语发现算法相结合,生成不同长度的频繁候选短语;然后,利用所提的完整性筛选和排序函数对候选短语进行筛选和排序;最后,根据排序结果选择最终的主题关键短语。In the big data era, the informat ion is exploding. The most popular informat ion among people connection is text message. On the Internet , there are countless text informat ion upload or download every day. The important way to quickly grasp content of countless text message is extracting keywords. However, the tradit ional work of extracting keywords from text corpora ignores two problems: the length of keywords and the topic of text corpora. In this paper,a new algorithm which is in consideration of two aspects mentioned above was proposed. This paper combined the LDA topic model and frequent phrases discovery algorithm to generate frequent candidate phrases with dif ferent length, at the same time, this paper proposed an algorithm of completeness filter and rank funct ion to filt and rank candidate. Final ly, according to the rank list,the real keyphrases were chosen.
关 键 词:关键词提取 LDA主题模型 频繁短语 完整性筛选 排序函数
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.43