检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李宏志[1,2] 李苋兰 赵生慧 Li Hongzhi;Li Xianlan;Zhao Shenghui(College of Computer and Information Engineering,Chuzhou University,Chuzhou 239000,China;College of Photonic and Electronic Engineering,Fujian Normal University,Fuzhou 350000,China)
机构地区:[1]滁州学院信息学院,安徽滁州239000 [2]福建师范大学光电与信息工程学院,福建福州350000
出 处:《湖南科技大学学报(自然科学版)》2020年第1期90-97,共8页Journal of Hunan University of Science And Technology:Natural Science Edition
基 金:安徽省自然科学基金资助面上项目(1408085MF126)。
摘 要:在使用KNN算法进行大规模文本分类,需要处理频繁的迭代运算,针对现有Hadoop平台迭代运算效率较低的问题,本文提出一种基于Spark平台的并行优化KNN算法.主要从3个方面对算法进行优化,首先,对于训练数据集通过剪枝算法控制有效数据的规模,从而减少迭代运算的次数;其次,针对高维数据集采用ID3算法利用信息熵进行属性降维,减少文本相似度的运算量;最后,使用Spark并行计算平台,引入内存计算最大限度地减少了迭代运算的I/O次数,提高处理速度.通过实验,与常用的KNN算法相比,基于Spark的KNN文本并行分类算法在加速比、扩展性等主要性能指标上表现较优,能够较好地满足大规模文本分类的需求.Aiming at the problem in the use of KNN algorithm for large-scale text classification,the Hadoop platform had a low efficiency in operating frequent iterative computation,a parallel optimization KNN algorithm based on the spark platform was proposed. Firstly,for training dataset,the size of effective data was controlled by branch reduction algorithm. Secondly,aiming at high-dimensional dataset,ID3 algorithm was used to reduce the dimension of attributes and reduced the computational complexity of text similarity. Finally,using the Spark platform,in-memory computing was introduced to minimize the number of I/O iterations and improved the computational speed. Compared with the traditional KNN algorithm,the KNN parallel classification algorithm based on the Spark platform had better performance on the main performance indexes,such as acceleration ratio and extensibility,and was better used on the large-scale text classification.
关 键 词:KNN 并行化 文本分类 SPARK RDD MAPREDUCE
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7