基于改进K-means的大气污染物高维度信息研究  

Study Honigh-dimensional Information of Atmospheric Pollutants Based on Improved K-means

在线阅读下载全文

作  者:黄乐成 陈超 韩存鑫 赵彬 HUANG Lecheng;CHEN Chao;HAN Cunxin;ZHAO Bin(School of Computer Science and Engineering,Sichuan University of Light Chemical Technology,Zigong 643000,Sichuan,China)

机构地区:[1]四川轻化工大学计算机科学与工程学院,四川自贡643000

出  处:《实验室研究与探索》2022年第9期135-139,共5页Research and Exploration In Laboratory

摘  要:对中国2013~2018年高分辨率大气污染分析开放数据集采用传统数据挖掘方法时,面临数据量大、挖掘效率低等难题,改用基于Spark K-means的聚类方法对大气污染物海量信息进行研究。以6种常见大气污染物和5种环境影响因子为例,建立了Pm_(2.5)、Pm_(10)、SO_(2)、NO_(2)、CO、O_(3)和Temp等数据维度模型。对K-means算法选择初始聚类数K值时,利用Gap Statistic算法相比传统K-means算法利用SSE算法确定K值,Gap Statistic算法在高维度样本数据模型中确定K值更合理且直观。For the high-resolution air pollution reanalysis of air pollution in China in 2013 and 2018,using the traditional data mining method was faced on the problems of large data volume and low mining efficiency,hence,the clustering method based on K-means was used to study the massive information of air pollutants under Spark.Using six common atmospheric pollutants and five environmental impact factors as examples,the data-dimensional model of Pm_(2.5),Pm_(10),So_(2),No_(2),Co,O_(3),Temp et al.is presented.When selecting the initial cluster number K value of the K-means algorithm,the gap statistic algorithm achieves the value of the best cluster number K in the high-dimensional sample data model,which is more convincing than the traditional K-means to determine the K value using the SSE algorithm.It demonstrates that the K values determined using the Gap Statistic algorithm are more reasonable and intuitive than the SSE algorithm.

关 键 词:大气污染数据 聚类分析 Gap Statistic算法 误差分析 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象