检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:崔晓晖[1] 师栋瑜 陈志泊[1] 许福[1] CUI Xiaohui;SHI Dongyu;CHEN Zhibo;XU Fu(College of Information Science and Technology,Beijing Forestry University,Beijing 100083,China)
机构地区:[1]北京林业大学信息学院
出 处:《农业机械学报》2019年第6期280-287,共8页Transactions of the Chinese Society for Agricultural Machinery
基 金:国家自然科学基金项目(61772078);北京林业大学热点追踪项目(2018BLRD18)
摘 要:针对当前“互联网+”技术与林业的交叉融合,涌现出海量待挖掘的涉林文本,而林业文本分类的相关研究尚不成熟的问题,使用网络爬虫技术面向互联网采集涉林文本,基于丰富的语料重新构建分类标签,提出基于Spark计算框架的XGBoost并行化方法,对林业文本进行分类。经由交叉验证,构建的XGBoost并行分类算法准确率为0.9234,在各类别中最低F1为0.8604,最高为0.9984;其在2.1万条、4.2万条、8.4万条数据集上的训练加速比分别为2.13、3.47、3.82。结果表明,基于该标签设定的分类模型对现存互联网中涉林文本的适应性较好;Spark环境下实现的XGBoost并行化算法的准确率显著优于其他4种机器学习(朴素贝叶斯、GBDT决策树、BP神经网络和ELM神经网络算法)的并行化算法,算法执行效率远高于单机版本,且数据量越大,其加速比越高,能有效应对海量林业文本的实时、准确分类。At present,the cross-integration of computer technology and forestry field had emerged a large number of forestry texts to be explored,and the shortcomings of related research could be summarized in two aspects: the classification labels in the existing classification system were set unscientific,leading to the classification model lacking of ability to classify the texts on net;the classification algorithm was mostly trained in the single-machine environment without considering its parallelism,then the algorithm could not deal with the actual large-scale data classification problem. Therefore,it was pretty realistic and urgency to design more scientific classification labels and classify forestry texts based on Spark framework. A new crawler technology was used to collect forestry-related texts,and re-construct labels by referring to the existing information retrieval system of forestry to improve the adaptability of classification models. Then the XGBoost parallelization implementation method was realized based on Spark,which completed the computing of training and prediction by RDD program mode. Through cross-validation method,the accuracy of XGBoost parallel algorithm could reach 0. 923 4. The lowest F1-measure value was 0. 860 4 and the highest was 0. 998 4. By training on the 21 thousand,42 thousand and 84 thousand data sets,the speedup ratios could reach 2. 13,3. 47 and 3. 82,respectively. The results showed that the new classification labels were set more scientific,and the system had better adaptability to the forestry-related texts on the existing internet. The precision and recall values of the XGBoost algorithm were significantly better than the four kinds of parallel algorithms based on Spark which included NB,gradient boosting decision tree,back propagation neural network,extreme learning machine and ran more effective than the stand-alone version. And with the increase of the data number,the acceleration ratio could be improved,which meant it was pretty useful to deal with the problem about the rea
关 键 词:林业文本 文本分类 大数据分析 SPARK XGBoost
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15