基于改进的ccLDA多数据源热点话题检测模型  被引量:4

Multi-source Topic Detection Analysis Based on Improved ccLDA Model

在线阅读下载全文

作  者:陈兴蜀 马晨曦[2] 王文贤 高悦 王海舟[1] 

机构地区:[1]四川大学网络空间安全学院,四川成都610065 [2]四川大学计算机学院,四川成都610065 [3]四川大学网络空间安全研究院,四川成都610065

出  处:《工程科学与技术》2018年第2期141-147,共7页Advanced Engineering Sciences

基  金:国家科技支撑计划资助项目(2012BAH18B05);国家自然科学基金资助项目(61272447);四川省科技厅计划资助项目(16ZHSF0483)

摘  要:目前,跨文本集的话题发现模型(cross-collection LDA,ccLDA)只适用于各个数据源话题相似度很高的场景,而且其全局话题和每个数据源的局部话题会强制对齐,存在词语稀疏的问题。针对ccLDA模型中的不足,提出了改进的跨文本集话题发现模型(improved ccLDA,IccLDA)。该模型在采样时先判断词语属于全局话题还是局部话题,再分别进行采样,避免了ccLDA模型中全局话题和局部话题必须对齐的缺点,进而降低了词语在全局话题和局部话题的分散程度,使该模型可以适用于多数据源的场景。在公开数据集上进行了多数据源文本集的话题发现实验,并进行了话题比较性分析。实验结果表明,在设置不同的话题数时,IccLDA模型的困惑度值均低于LDA模型和ccLDA模型,表明IccLDA模型具有更优的建模能力。最后,在真实数据集上开展了进一步实验验证,证明了本文提出的改进模型不仅建模能力优于原始模型,还可以有效地发现各个数据源讨论的公共话题和每个数据源讨论的局部话题,更适用于多数据源场景的文本话题发现。At present,ccLDA (cross collection LDA) model has been found only applicable to data sources that topic similarity is very high, and its global topics and local topics of each data source will be forced alignment,hence causing words sparse.In order to solve the problem of ccLDA model,an improved ccLDA topic model (IccLDA) was proposed.When sampling,this model firstly decides whether words are global topics or loc- al topics,and then takes samples respectively.In this way,it can avoid the problem that the global topics and local topics in ccLDA model must be aligned,and also can reduce the dispersion degree of the words in the global topics and local topics,making the model suitable for multiple data source scenarios.The topic discovery experiments of multiple data source were conducted on public data sets,and a comparative analysis of topics was conducted.The experimental results showed that the confusion degree of IccLDA model is lower than LDA model and ccLDA model,indicat- ing that IccLDA model has better modeling ability.Finally,further experimental verification was performed with the data sets of real-world seen- arios.The result showed that the improved model not only has better modeling ability than the traditional models,but also can effectively discover public topics discussed by various data sources and local topics discussed by each data source,and is more suitable for topic discovery in multiple data source scenarios.

关 键 词:话题检测 话题模型 LDA 多数据源 IccLDA 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象