基于主题模型的科技报告文档聚类方法研究  被引量:16

Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model

在线阅读下载全文

作  者:曲靖野 陈震 郑彦宁[2] 

机构地区:[1]北华大学信息技术与传媒学院,吉林132013 [2]中国科学技术信息研究所,北京100038

出  处:《图书情报工作》2018年第4期113-120,共8页Library and Information Service

基  金:吉林省教育科学“十三五”规划项目“项目教学法在高校基础计算机教学中的应用研究”(项目编号:GH170061)研究成果之一

摘  要:[目的/意义]探索实践以科技报告为文献载体形式的融合主题模型的文本聚类方法,拓展基于科技文献进行技术监测服务的新领域,提出基于科技报告进行语义分析的新方法。[方法/过程]以国家科技报告服务系统中的科技报告为数据源,首先基于LDA主题模型对经过文本预处理的科技报告进行主题挖掘,再基于Ward与K-means相结合的聚类算法对包含主题分布信息的文本向量进行聚类分析,尝试提出一种适合科技报告文档聚类的文本挖掘新方法。[结果/结论]实验结果表明,LDA主题模型能有效准确挖掘科技报告中的主题信息,所提出的Ward与K-means相结合的聚类算法对科技报告的聚类效果也优于其它传统聚类算法。[ Purpose/significance] This paper explores the method of text clustering in the science and technology reports based on the topic model, develops new scientific literature technology monitoring areas, and puts forward a new semantic analysis method based on science and technology reports. [ Method/process] Based on the national science and technology report service system, firstly, it conducted topic mining based on the LDA model after the text preprocessing; secondly, a clustering analysis based on the combination of K-means and Ward was carried out based on the text vector of the abstract containing theme distribution information. A proper text clustering method for the text mining suitable for the science and technical report was proposed. [ Result/conclusion] The experimental results show that the LDA model can be effectively and accurately used in the topic mining of science and technology reports, and the clustering effect of the combination of Ward and K-means proposed in this paper is better than that of other traditional clustering algorithms in sci- ence and technology reports.

关 键 词:科技报告 主题模型 LDA 文本聚类 

分 类 号:G203[文化科学—传播学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象