基于隐含语义分析的在线新闻话题发现方法  被引量:1

Online News Topics Extraction Based on Latent Semantic Analysis

在线阅读下载全文

作  者:武高敏 张宇晨[1] 韩京宇[1,2] 

机构地区:[1]南京邮电大学计算机学院,江苏南京210003 [2]东南大学计算机网络和信息集成教育部重点实验室,江苏南京211189

出  处:《计算机技术与发展》2016年第9期1-7,共7页Computer Technology and Development

基  金:国家自然科学基金重点项目(61003040;61100135;61302157)

摘  要:互联网的飞速发展和海量数据的不断增长,使得如何快速、有效地识别当前新闻热点信息成为迫切需求。在线新闻话题发现已成为当前研究热点。对于在线环境下的新闻文本特征表示,传统向量空间模型随着数据的增长向量维度不断增长,使得数据稀疏和同名异议问题愈加明显,导致文本相似度难以准确度量。使用基于特征加权的隐含语义分析将高维、稀疏的词-文档矩阵映射到隐藏的k维语义空间,充分挖掘词、文档之间的语义信息,以提高同主题文档间的语义相似度,克服在线环境下文本稀疏性和同名异议问题。此外,对于不断增长的大规模新闻数据,传统聚类算法存在时间复杂度过高或者输入依赖等问题,难以快速、有效地得到理想结果。基于新闻报道在时间上的顺序性和相关性,提出改进的Single-pass在线增量聚类算法检测话题类,并引入话题热度值的概念来筛选当前关注度较高的热点话题。实验结果表明,该方法能够有效提高话题检测的准确率,实现基于真实新闻数据集的在线话题捕捉。With the rapid development of the Internet and the continuous increasing of massive data, how to identify the current news topic quickly and effectively is becoming an urgent demand, and online hot news topic detection has become an hot area of research. For online news stream, the degree of traditional Vector Space Model (VSM) will grow with the increasing of data, resulting in obvious problem of data sparsity and synonymy, which makes it difficult to quickly and accurately calculate the similarity of texts. The latent semantic analysis based on weighted features is used to map the sparse matrix with high-dimension of words and documents to the hidden k-dimension se- mantic space, making full use of the semantic information between words and documents to improve the semantic similarity between the same subject documents, overcoming the problems of text sparsity and synonymy in Intemet. In addition, traditional clustering algorithm exists the problem of high time complexity and input dependency for increasing massive news data, which is difficult to get the expected result quickly and efficiently. A Single-pass online clustering algorithm is used to detect the topic clusters based on succession and corre- lation in time for news, and the concept of topic heat is introduced to screen the public attention of news topics. Experiment shows that the method proposed can effectively improve the accuracy of the detection of topics.

关 键 词:话题发现 向量空间模型 隐含语义分析 文本聚类 奇异值分解 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象