Spark平台下类别数据互信息计算的并行化被引量：3

Parallel Mutual-Information Computation of Categorical Data Based on Spark

作　　者：李俊丽 LI Junli(School of Information technology and Engineering,Jinzhong University,Jinzhong,Shanxi 030619,China)

机构地区：[1]晋中学院信息技术与工程学院,山西晋中030619

出　　处：《计算机工程与应用》2021年第7期95-100,共6页Computer Engineering and Applications

基　　金：国家自然科学基金(61876122);国家自然科学基金青年科学基金项目(61602335);晋中学院1331工程创新团队项目。

摘　　要：针对大规模类别数据的互信息计算量非常大的问题,利用Spark内存计算平台,提出了类别数据的并行互信息计算方法,该算法首先采用列变换将数据集转换成多个数据子集;然后采用两个变长数组缓存中间结果,解决了类别数据特征对间互信息计算量大、重复性强的问题;最后在配备了24个计算节点的Spark集群中,使用人工合成和真实数据集验证了算法。实验结果表明,该算法在效率、可伸缩性和可扩展性等方面都达到了较高的性能。To resolve the problem of large amount of mutual information calculation for large-scale categorical data,this paper proposes a Parallel Mutual information calculation method for categorical data(PMS),which is based on the Spark memory computing platform.This algorithm first uses column transformation to transform the data set into multiple data subsets.And then,PMS uses two variable-length arrays to cache intermediate results,solving the problem of large amount of calculation and strong repeatability in categorical data mutual information calculation.Finally,PMS algorithm is implemented and evaluated in a Spark cluster equipped with 24 computing nodes using artificial and real data sets.Experimental results verify that PMS algorithm achieves high performance in terms of efficiency,scalability and scalability.

关键词：列变换并行互信息计算分类数据 Spark平台

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Spark平台下类别数据互信息计算的并行化被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Spark平台下类别数据互信息计算的并行化 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

Spark平台下类别数据互信息计算的并行化被引量：3