大数据相关分析综述  被引量:245

A Survey on Correlation Analysis of Big Data

在线阅读下载全文

作  者:梁吉业[1] 冯晨娇[1,2] 宋鹏[1,3] 

机构地区:[1]山西大学计算智能与中文信息处理教育部重点实验室,太原030006 [2]山西财经大学应用数学学院,太原030006 [3]山西大学经济与管理学院,太原030006

出  处:《计算机学报》2016年第1期1-18,共18页Chinese Journal of Computers

基  金:国家自然科学基金(61432011;U1435212;71301090);国家"九七三"重点基础研究发展规划项目基金(2013CB329404);山西省高等学校创新人才支持计划(2013052006)资助

摘  要:大数据时代,相关分析因其具有可以快捷、高效地发现事物间内在关联的优势而受到广泛的关注,并有效地应用于推荐系统、商业分析、公共管理、医疗诊断等领域.面向非线性、高维性等大数据的复杂特征,结合现有相关分析方法的语义分析,文中从统计相关分析、互信息、矩阵计算、距离4个方面对大数据相关分析的现有研究成果进行了梳理.在对统计学中的经典相关分析理论进行归纳、总结的基础上,文中从大规模数据的通用性和均等性视角阐述了基于互信息的两个变量间非线性相关分析理论,从高维数据可计算的角度分析了基于矩阵计算的相关系数,从非线性、高维性数据的复杂结构方面解析了基于距离的相关系数.进一步地,该文在对已有相关分析方法进行分析与比较的基础上,围绕高维数据、多变量数据、大规模数据、增长性数据及其可计算方面探讨了大数据相关分析的研究挑战.In the big data time, correlation analysis has attracted much attention for its high- efficiency in analyzing inherent relation of things, and been effectively applied to many fields including recommender system, business analytics, public administration and medical diagnosis. Big data is usually nonlinear and high-dimensional. On the consideration of these complex characteristics and the semantic analysis for existing correlation analysis approaches, this paper gives a discussion of existing research findings of correlation analysis for big data. The discussion is analyzed from four aspects including statistical correlation analysis, mutual information, matrix calculation and distance. Based on summarizing classical correlation analysis theory in statistics, this paper firstly elaborates the nonlinear correlation analysis approaches between two stochastic variables induced by mutual information from the view of generality and equitability. Then, the correlation coefficient based on matrix calculation is analyzed in term of computability of high- dimensional data and the distance correlation is analyzed from the point of complicated formation of nonlinear and high-dimensional data. Furthermore, on the account of analyzing and comparing existing correlation analysis approaches, challenges of correlation analysis namely high dimensional data, multivariable data, large-scale data, computability. for big data are studied, incremental data and its

关 键 词:大数据 相关分析 相关系数 信息熵 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象