基于重叠度与完整度的LDA主题优选方法  被引量:4

Optimal Selection Method for LDA Topics Based on Degree of Overlap and Completeness

在线阅读下载全文

作  者:柏志安[1] 曾剑平[2] BAI Zhi’an;ZENG Jianping(Computer Centre,Rui Jin Hospital,Shanghai Jiao Tong University School of Medicine,Shanghai 200025,China;School of Computer Science,Fudan University,Shanghai 200433,China)

机构地区:[1]上海交通大学医学院附属瑞金医院计算机中心,上海200025 [2]复旦大学计算机科学技术学院,上海200433

出  处:《计算机工程与应用》2019年第12期155-161,共7页Computer Engineering and Applications

基  金:上海市自然科学基金(No.15ZR1403700)

摘  要:以LDA为基础的许多主题模型能够从一定数量的文本中推断出主题个数及主题描述,其存在的问题是主题个数难于确定,也难于决定描述每个主题的特征词汇。针对这个问题,结合LDA与TF-IDF量化的效果,同时考虑对原文本集的涵盖程度以及主题间的独立性,提出了一种Overlap-Completeness得分法的主题区分度优选方法。该方法在LDA建模的基础上,利用TF-IDF获取主题最具代表性的词汇,定义主题词汇间的重叠度、表达的完整度,给出了主题优选的评价方法。最终不仅能得到最佳主题数目,而且还能得到每个主题的最合适的描述词汇。在信息安全新闻文本集上进行了实验研究,结果表明该方法与基本的LDA模型相比,更能选择出有区分度的主题和有代表性的词汇。Many topic modeling methods can infer topic number and topic description from large text data set based on LDA, however, there exists several problems, such as determination of topic number, and selection of topic words. The paper proposes a new method to select optimal topic description based on Overlap-Completeness score. It combines LDA and TF-IDF, and takes completeness of words and word independency into consideration. Based on the result of LDA, TF-IDF is utilized to select distinctive words for each topic, then the degree of overlap between the vocabularies of different topics, and the degree of completeness in topic description are defined, and finally the optimal selection method is presented. The method can not only get the best topic number, but also the best description words for each topic. Experiments based on news about information security topic show that, compared with the traditional LDA model, this method can get distinctive topics and representative words.

关 键 词:LDA模型 TF-IDF 主题识别 重叠度 完整度 

分 类 号:TP274[自动化与计算机技术—检测技术与自动化装置]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象