基于Spark的大规模文本KNN并行分类算法  被引量:2

Large Scale Text KNN Parallel Classification Algorithm Based on Spark

在线阅读下载全文

作  者:李宏志[1,2] 李苋兰 赵生慧 Li Hongzhi;Li Xianlan;Zhao Shenghui(College of Computer and Information Engineering,Chuzhou University,Chuzhou 239000,China;College of Photonic and Electronic Engineering,Fujian Normal University,Fuzhou 350000,China)

机构地区:[1]滁州学院信息学院,安徽滁州239000 [2]福建师范大学光电与信息工程学院,福建福州350000

出  处:《湖南科技大学学报(自然科学版)》2020年第1期90-97,共8页Journal of Hunan University of Science And Technology:Natural Science Edition

基  金:安徽省自然科学基金资助面上项目(1408085MF126)。

摘  要:在使用KNN算法进行大规模文本分类,需要处理频繁的迭代运算,针对现有Hadoop平台迭代运算效率较低的问题,本文提出一种基于Spark平台的并行优化KNN算法.主要从3个方面对算法进行优化,首先,对于训练数据集通过剪枝算法控制有效数据的规模,从而减少迭代运算的次数;其次,针对高维数据集采用ID3算法利用信息熵进行属性降维,减少文本相似度的运算量;最后,使用Spark并行计算平台,引入内存计算最大限度地减少了迭代运算的I/O次数,提高处理速度.通过实验,与常用的KNN算法相比,基于Spark的KNN文本并行分类算法在加速比、扩展性等主要性能指标上表现较优,能够较好地满足大规模文本分类的需求.Aiming at the problem in the use of KNN algorithm for large-scale text classification,the Hadoop platform had a low efficiency in operating frequent iterative computation,a parallel optimization KNN algorithm based on the spark platform was proposed. Firstly,for training dataset,the size of effective data was controlled by branch reduction algorithm. Secondly,aiming at high-dimensional dataset,ID3 algorithm was used to reduce the dimension of attributes and reduced the computational complexity of text similarity. Finally,using the Spark platform,in-memory computing was introduced to minimize the number of I/O iterations and improved the computational speed. Compared with the traditional KNN algorithm,the KNN parallel classification algorithm based on the Spark platform had better performance on the main performance indexes,such as acceleration ratio and extensibility,and was better used on the large-scale text classification.

关 键 词:KNN 并行化 文本分类 SPARK RDD MAPREDUCE 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象