检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:余琴琴 彭敦陆[1] 刘丛[1] YU Qin-qin;PENG Dun-lu;LIU Cong(School of Optical-Electrical and Computer Engineering i University of Shanghai for Science and Technology, Shanghai 200093, China)
机构地区:[1]上海理工大学光电信息与计算机工程学院,上海200093
出 处:《小型微型计算机系统》2018年第5期1027-1032,共6页Journal of Chinese Computer Systems
基 金:国家自然科学基金项目(61003031)资助;上海市自然科学基金项目(10ZR1421100)资助
摘 要:目前,大多数文本特征抽取算法是针对特征词集进行抽取的,由于文本数据量大,且内容描述具有多义性和复杂性,以词为单元的特征抽取结果通常存在歧义.为了解决该问题,论文首先将文本生成词序列,综合考虑了词语在词序列中有序性、可重复性和同义性,利用加权关联规则挖掘方法,对频繁词集进行组合生成特征短语.为提高计算效率,针对大规模文本数据特征短语抽取问题,采用MapReduce计算思想对所提算法进行了扩展.实验表明,该算法具有较高的运行效率,而且可以获得较为准确的特征短语.At present,most of algorithms for feature extracting of textual data are focusing on the extraction of feature words. Due to the large amount,the ambiguity and complexity of the description of the text content,the results of extracted feature word as an unit are usually ambiguous. In order to solve this problem,this paper firstly generates a sequence of words from a document. By considering the order,repeatability and synonymy of the words in the sequence,we take advantage of the idea of mining weighted association rules to compute feature phrases by combining frequent words. In addition,for extracting feature phrases from massive textual data,we adopt the computational model of MapReduce to extend the proposed method. Experimental results show that the proposed algorithm has higher efficiency of extracting accurate feature phrases for large-scale textual data compared with the existing algorithms.
关 键 词:MAPREDUCE 词序列 加权关联规则 频繁词集 特征短语
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.200