检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈兴蜀[1,2] 罗梁[2] 王海舟[1,2] 王文贤[1,2] 高悦[2]
机构地区:[1]四川大学网络空间安全研究院,四川成都610065 [2]四川大学计算机学院,四川成都610065
出 处:《工程科学与技术》2017年第2期100-106,共7页Advanced Engineering Sciences
基 金:国家科技支撑计划资助项目(2012BAH18B05);国家自然科学基金资助项目(61272447);四川大学青年教师启动基金(2015SCU11079)
摘 要:近年来互联网在全球化的大背景下飞速发展,针对跨语言的网络数据挖掘成为国内外舆情分析的热点问题,有效实时地检测中英文网络环境下的热点话题对舆情的掌握和舆情的发展有着至关重要的作用。网络新闻作为网络信息舆情中的重要组成部分,由于互联网的大规模普及而成为人们方便快捷获知信息的重要来源。首先,本文选择中文与英文的网络新闻作为数据源进行采集,提出了在LDA模型上改进的ICE-LDA模型进行跨英汉语言网络环境下的共现话题发现。采用话题向量化的方式,对建模产生的话题进行JS距离检测和话题文本分布相似度度量。其次,本文分别对爬虫采集到的中英混合新闻数据分别构建可对比平行语料集和非可对比语料集进行话题建模,在建模过程中利用TF-IDF算法对文档提取特征词去噪,提高话题特征表示去除无意义噪音词。最后,分别采用两种不同的话题向量化方式进行跨语言的共现话题发现建模。实验结果表明,在本文设计的爬虫采集构建的真实数据集上,改进后的话题模型不仅能够在不需要先验话题对的情况下对可对比语料集进行跨语言共现话题进行发现,而且能够对语料不平衡的情况进行共现话题发现。With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis. Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion. Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens. Firstly,Internet news in Chinese and English network were collected. Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset. Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model. Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data. During model building,the TF-IDF algorithm was used to remove noise words of the text. Finally,two kinds of topic vectors were used to detect the co-occurrence topics. The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.
关 键 词:话题发现 跨英汉文本 ICE-LDA模型 TF-IDF特征提取 共现话题
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.43