机构地区:[1]中国科学院网络数据科学与技术重点实验室,北京100190 [2]中国科学院计算技术研究所,北京100190 [3]中国科学院大学,北京100049
出 处:《计算机学报》2018年第7期1490-1503,共14页Chinese Journal of Computers
基 金:国家自然科学基金(61572473;61472400);国家青年科学基金(61303156)资助~~
摘 要:关于舆情事件的新闻数据是纷繁复杂的.即便是关于同一舆情事件的新闻数据,往往包含有不同的子话题(事件的不同侧面).因此,如何生成能够准确描述事件子话题含义的标签对深入分析舆情事件(包括掌握事件热点、监测发展走向等)具有重要意义.事件子话题标签的生成通常包括两个关键步骤:首先发现子话题,然后依据每个子话题的关键词或文档内容生成描述该子话题的有效标签.传统方法在发现话题时多采用聚类或分类的方法,它们将同一个话题的文档整合到一个簇中.然而,由于隶属同一事件的文档具有很强的相似性,现有方法难以度量他们之间的距离,因此无法应用于发现事件子话题这一任务.此外,在为子话题生成标签时,传统的方法通常通过抽取来实现.此类方法所生成标签的准确性无法保证.为此,该文提出了一种基于PLSA with Background Language并结合关键词聚类发现事件内部子话题,进而基于维基百科等知识库生成事件子话题标签的模型ET-TAG.在多类舆情事件数据集上的实验结果表明,ET-TAG算法相比K-means和LDA等已有子话题发现方法具有更好的性能;从子话题标签生成角度而言,ET-TAG生成的标签相对于传统方法也具有更好的准确性和概括性.该文最后将ET-TAG算法生成的子话题标签用于事件的对比和追踪,结果表明通过子话题标签可以发现事件共性,并反映事件子话题热度的变化趋势.The public opinion system is a system to monitor the trend of public opinion on the Web.Through the public opinion system,we can understand hot spots on the Web and track their trends.Events are the focus of the public opinion system.News data about public opinion events are very complicated.Even for the data about the same event,it often contains different sub-topics(different perspective of the event).The sub-topics of an event can reflect its different aspects.For example,in the event of an earthquake,sub-topics include earthquake details,rescue work,post-disaster reconstruction,and so on.These sub-topics not only embody different aspects of the event,but also reflect the hot spots that public opinion may concern about.Tags of events sub-topics can be regarded as the attributes of events,which can help us to describe and comprehensively understand the events.Through sub-topics,we can compare the similarities and differences between different events,and the sub-topic tags in a certain period of time can reflect changes in public opinion for the spots of events.It is significance to detect sub-topics of events and generate accurate sub-topic tags for public opinion system.It usually contains two major steps to generate the tags of sub-topics of a public opinion event:It first discovers sub-topics and then generates effective tags for them based on their corresponding keywords and documents.Existing methods for discovering topics or sub-topics are usually based on clustering or classification,which put the documents about the same topic into the same cluster.However,as the documents about the same event are similar to each other,it is very difficult for existing methods to measure the distance between these documents and thus they cannot effectively differentiate the sub-topics in the same event.There are a lot of high frequency background words in each document,how to ensure the diversity of sub-topics is a big problem.In addition,traditional methods often employ an extraction based manner to generate sub-topics
关 键 词:子话题发现 PLSA with BACKGROUND LANGUAGE 关键词聚类 子话题标签生成
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...