基于LDA的社科文献主题建模方法  被引量:6

A Topic Modeling Method for Social Science Literature Based on LDA

在线阅读下载全文

作  者:李昌亚 刘方方[1] 

机构地区:[1]上海大学计算机工程与科学学院,上海200444

出  处:《计算机技术与发展》2018年第2期182-187,共6页Computer Technology and Development

基  金:上海市科委自然基金(12zr1411000)

摘  要:随着互联网的发展,文本分类和主题提取的应用越来越广泛,而主题模型在文本主题提取中起着很大的作用。LDA(latent Dirichlet allocation)模型是一种应用非常广泛且很成熟的主题模型,也是一个概率生成模型,可以很好地解决多词一义和一词多义的问题。但是当利用LDA模型对社科文献领域类的文档集进行主题建模时,由于该建模方法忽略了文档集自身的主题特点,提取的主题分布是偏向文档中高频词汇,所以造成最后提取的主题偏离文档的本质意义上的主题、结果不够准确。针对LDA模型对文档进行主题建模的过程,结合社科文献领域的文档特点,对主题建模的过程进行相应的改进,提出一种新的主题建模方法,从而使最终提取的主题更加准确,更符合文档集本身的主题特点。With the development of the Internet,the application of text classification and topic extraction is becoming more and more widely,and topic model plays a critical role in topic extraction of the text. LDA (latent Dirichlet allocation),as an extensive and mature topic model,is also a probability generation model,which can solve the problem of synonym and polysemy. But when LDA model is used to model thedocument collection in the domain of social science literature,because of its ignorance of the topic characteristics of document collection itself,the topic distribution extracted by the modeling method is to trend the high frequency words,which makes the extracted topic deviatedfrom the document topic in nature and the results inaccurate. In this paper,aiming at the topic modeling of document with LDA model andcombined with the characteristics of the document in the field of social literature,we present a new topic modeling method to improve accordingly the process of modeling,so that the topic of the final extraction is more accurate and more consistent with the topic characteristicsof the document collection itself.

关 键 词:主题模型 LDA 社科文献 GIBBS抽样 

分 类 号:TP31[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象