基于Spark框架XGBoost的林业文本并行分类方法研究被引量：11

Parallel Forestry Text Classification Technology Based on XGBoost in Spark Framework

作　　者：崔晓晖[1] 师栋瑜陈志泊[1] 许福[1] CUI Xiaohui;SHI Dongyu;CHEN Zhibo;XU Fu(College of Information Science and Technology,Beijing Forestry University,Beijing 100083,China)

机构地区：[1]北京林业大学信息学院

出　　处：《农业机械学报》2019年第6期280-287,共8页Transactions of the Chinese Society for Agricultural Machinery

基　　金：国家自然科学基金项目(61772078);北京林业大学热点追踪项目(2018BLRD18)

摘　　要：针对当前“互联网+”技术与林业的交叉融合,涌现出海量待挖掘的涉林文本,而林业文本分类的相关研究尚不成熟的问题,使用网络爬虫技术面向互联网采集涉林文本,基于丰富的语料重新构建分类标签,提出基于Spark计算框架的XGBoost并行化方法,对林业文本进行分类。经由交叉验证,构建的XGBoost并行分类算法准确率为0.9234,在各类别中最低F1为0.8604,最高为0.9984;其在2.1万条、4.2万条、8.4万条数据集上的训练加速比分别为2.13、3.47、3.82。结果表明,基于该标签设定的分类模型对现存互联网中涉林文本的适应性较好;Spark环境下实现的XGBoost并行化算法的准确率显著优于其他4种机器学习(朴素贝叶斯、GBDT决策树、BP神经网络和ELM神经网络算法)的并行化算法,算法执行效率远高于单机版本,且数据量越大,其加速比越高,能有效应对海量林业文本的实时、准确分类。At present,the cross-integration of computer technology and forestry field had emerged a large number of forestry texts to be explored,and the shortcomings of related research could be summarized in two aspects: the classification labels in the existing classification system were set unscientific,leading to the classification model lacking of ability to classify the texts on net;the classification algorithm was mostly trained in the single-machine environment without considering its parallelism,then the algorithm could not deal with the actual large-scale data classification problem. Therefore,it was pretty realistic and urgency to design more scientific classification labels and classify forestry texts based on Spark framework. A new crawler technology was used to collect forestry-related texts,and re-construct labels by referring to the existing information retrieval system of forestry to improve the adaptability of classification models. Then the XGBoost parallelization implementation method was realized based on Spark,which completed the computing of training and prediction by RDD program mode. Through cross-validation method,the accuracy of XGBoost parallel algorithm could reach 0. 923 4. The lowest F1-measure value was 0. 860 4 and the highest was 0. 998 4. By training on the 21 thousand,42 thousand and 84 thousand data sets,the speedup ratios could reach 2. 13,3. 47 and 3. 82,respectively. The results showed that the new classification labels were set more scientific,and the system had better adaptability to the forestry-related texts on the existing internet. The precision and recall values of the XGBoost algorithm were significantly better than the four kinds of parallel algorithms based on Spark which included NB,gradient boosting decision tree,back propagation neural network,extreme learning machine and ran more effective than the stand-alone version. And with the increase of the data number,the acceleration ratio could be improved,which meant it was pretty useful to deal with the problem about the rea

关键词：林业文本文本分类大数据分析 SPARK XGBoost

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark框架XGBoost的林业文本并行分类方法研究被引量：11

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark框架XGBoost的林业文本并行分类方法研究 被引量：11

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark框架XGBoost的林业文本并行分类方法研究被引量：11