基于ICE-LDA模型的中英文跨语言话题发现研究被引量：7

Analysis and Research on Cross Language Topic Discovery in Chinese and English

作　　者：陈兴蜀[1,2] 罗梁[2] 王海舟[1,2] 王文贤[1,2] 高悦[2]

机构地区：[1]四川大学网络空间安全研究院,四川成都610065 [2]四川大学计算机学院,四川成都610065

出　　处：《工程科学与技术》2017年第2期100-106,共7页Advanced Engineering Sciences

基　　金：国家科技支撑计划资助项目(2012BAH18B05);国家自然科学基金资助项目(61272447);四川大学青年教师启动基金(2015SCU11079)

摘　　要：近年来互联网在全球化的大背景下飞速发展,针对跨语言的网络数据挖掘成为国内外舆情分析的热点问题,有效实时地检测中英文网络环境下的热点话题对舆情的掌握和舆情的发展有着至关重要的作用。网络新闻作为网络信息舆情中的重要组成部分,由于互联网的大规模普及而成为人们方便快捷获知信息的重要来源。首先,本文选择中文与英文的网络新闻作为数据源进行采集,提出了在LDA模型上改进的ICE-LDA模型进行跨英汉语言网络环境下的共现话题发现。采用话题向量化的方式,对建模产生的话题进行JS距离检测和话题文本分布相似度度量。其次,本文分别对爬虫采集到的中英混合新闻数据分别构建可对比平行语料集和非可对比语料集进行话题建模,在建模过程中利用TF-IDF算法对文档提取特征词去噪,提高话题特征表示去除无意义噪音词。最后,分别采用两种不同的话题向量化方式进行跨语言的共现话题发现建模。实验结果表明,在本文设计的爬虫采集构建的真实数据集上,改进后的话题模型不仅能够在不需要先验话题对的情况下对可对比语料集进行跨语言共现话题进行发现,而且能够对语料不平衡的情况进行共现话题发现。With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis. Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion. Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens. Firstly,Internet news in Chinese and English network were collected. Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset. Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model. Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data. During model building,the TF-IDF algorithm was used to remove noise words of the text. Finally,two kinds of topic vectors were used to detect the co-occurrence topics. The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.

关键词：话题发现跨英汉文本 ICE-LDA模型 TF-IDF特征提取共现话题

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于ICE-LDA模型的中英文跨语言话题发现研究被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于ICE-LDA模型的中英文跨语言话题发现研究 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于ICE-LDA模型的中英文跨语言话题发现研究被引量：7