大规模词序列中基于频繁词集的特征短语抽取模型  被引量:1

Extracting Feature Phrases from Large-scale Word Sequences Based on Frequent Word Sets

在线阅读下载全文

作  者:余琴琴 彭敦陆[1] 刘丛[1] YU Qin-qin;PENG Dun-lu;LIU Cong(School of Optical-Electrical and Computer Engineering i University of Shanghai for Science and Technology, Shanghai 200093, China)

机构地区:[1]上海理工大学光电信息与计算机工程学院,上海200093

出  处:《小型微型计算机系统》2018年第5期1027-1032,共6页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(61003031)资助;上海市自然科学基金项目(10ZR1421100)资助

摘  要:目前,大多数文本特征抽取算法是针对特征词集进行抽取的,由于文本数据量大,且内容描述具有多义性和复杂性,以词为单元的特征抽取结果通常存在歧义.为了解决该问题,论文首先将文本生成词序列,综合考虑了词语在词序列中有序性、可重复性和同义性,利用加权关联规则挖掘方法,对频繁词集进行组合生成特征短语.为提高计算效率,针对大规模文本数据特征短语抽取问题,采用MapReduce计算思想对所提算法进行了扩展.实验表明,该算法具有较高的运行效率,而且可以获得较为准确的特征短语.At present,most of algorithms for feature extracting of textual data are focusing on the extraction of feature words. Due to the large amount,the ambiguity and complexity of the description of the text content,the results of extracted feature word as an unit are usually ambiguous. In order to solve this problem,this paper firstly generates a sequence of words from a document. By considering the order,repeatability and synonymy of the words in the sequence,we take advantage of the idea of mining weighted association rules to compute feature phrases by combining frequent words. In addition,for extracting feature phrases from massive textual data,we adopt the computational model of MapReduce to extend the proposed method. Experimental results show that the proposed algorithm has higher efficiency of extracting accurate feature phrases for large-scale textual data compared with the existing algorithms.

关 键 词:MAPREDUCE 词序列 加权关联规则 频繁词集 特征短语 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象