域内海量数据中热点话题及其特征词抽取方法  被引量:3

Research on extracting hot topic words and their local features from massive online data

在线阅读下载全文

作  者:袁华[1] 徐华林 钱宇[1] 罗谦[3] YUAN Hua;XU Hua-lin;QIAN Yu;LUO Qian(School of Management and Economics,University of Electronic Science and Technology of China,Chengdu 611731,China;Department of Information and Engineering,Sichuan Tourism University,Chengdu 610100,China;The Second Research Institute of CAAC,Chengdu 610041,China)

机构地区:[1]电子科技大学经济与管理学院,四川成都611731 [2]四川旅游学院信息与工程学院,四川成都610100 [3]中国民用航空总局第二研究所,四川成都610041

出  处:《管理工程学报》2018年第4期133-140,共8页Journal of Industrial Engineering and Engineering Management

基  金:国家自然科学基金资助项目(71271044;U1233118;71102055;71572029;71490723)

摘  要:在特定信息域内的网络文档中,主题及其特征词的抽取工作是近年人工语言处理研究的重点,其研究结果具有显著的管理决策意义。本研究提出一种新的数据挖掘方法用于从海量UGC中分析出其"热点话题词"和"局部特征词"之间的关联关系。首先,利用网页抓取工具从网上获得某个域相关的文档,并对文档内容实施分词操作。然后,基于分词结果,抽取网页文档中存在的域内信息词并组成新的数据集。最后,我们提出一种基于热点话题词和语义分隔符号的数据集切分方法来获得每个热点话题词相关的本地特征词数据集。并且在该数据集上,可以分析出特征词对于热点话题词的依赖关系,从而找到每个话题词最恰当的特征词集合。该方法算法简单,尤为重要的是它能很好屏蔽那些不相关的高频共现词对特征抽取的影响,可广泛应用于文本相关的在线信息检索任务,为管理决策和电子商务活动服务。With the wide application and rapid development of social media,the massive UGC(user-generated content)released in the form of text has played an immeasurable role in information transmission and storage.Different from the contents traditionally provided by professionals,UGCs are contributed by end users and thus they usually contain diversified perspectives,which may have a high possibility of meeting a potential user’s needs.In this sense,UGCs embody richer information that can help users make better decisions.But meanwhile,UGCs have been continuously generated,resulting in a huge amount of documents.In the absence of formal inspection,UGCs display various information qualities.As a result,it is highly difficult to extract useful information from UGCs.Therefore,how to efficiently and effectively retrieve useful information from this enormous amount of documents is a challenge in the area of data mining.In the field of information retrieval,the classic Bag of word model presents a document as a combination of words and greatly promotes the automation of document processing(e.g.,document classification).The Bag of word model treats all words in a document to be equally important.However,the UGCs are generated casually,sparsely,and in a non-standard format,and therefore,direct application of the Bag of word model to extract the information from UGCs will face a challenge of dimension disaster.In recent years,by considering the semantic relationship between words,researchers have begun to view a document as a combination of topics and each topic a combination of words.This approach along with the introduced LDA(Latent Dirichlet Allocation)model has achieved a plethora of applications.Nevertheless,this approach is usually implemented based on probability models;large-scale corpus training is needed to get a good result,implying a high level of computational complexity.Moreover,crossover problems and information interaction of different areas have become more and more popular,resulting in frequent interdisciplinary

关 键 词:在线信息检索 频繁模式挖掘 最大置信度 信息域 特征抽取 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象