检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:朱映秋 黄丹阳 张波[2,3] Zhu Yingqiu;Huang Danyang;Zhang Bo
机构地区:[1]对外经济贸易大学统计学院 [2]中国人民大学应用统计科学研究中心 [3]中国人民大学统计学院
出 处:《统计研究》2024年第6期147-160,共14页Statistical Research
基 金:国家自然科学基金面上项目“稀疏网络数据的建模,计算及应用”(12071477);国家自然科学基金青年项目“基于支付数据的中小微企业行为模式聚类分析”(72301070);教育部人文社会科学重点研究基地重大项目“数字时代的统计学理论与方法研究”(22JJD110001);对外经济贸易大学中央高校基本科研业务费专项资金资助“基于分布函数的分布式聚类算法研究”(22QD09)。
摘 要:随着信息技术的发展,人类社会产生的数据规模越来越庞大、形式越来越复杂,对聚类分析形成了巨大挑战。在越来越多的应用场景中,观测数据具有相互关联、层次嵌套的结构,使传统聚类方法难以直接适用。通常的解决方案是采用特征工程方法将观测信息压缩为低维特征向量进行聚类,但这将带来不可避免的信息损失。为充分利用观测数据,本文以分布函数表示聚类对象,大幅降低信息损失,进而提出基于高斯混合模型的分布因子模型。该模型将聚类对象的观测数据分解为两部分,一是以高斯成分表示的公共因子,反映数据中具有共性的典型模式;二是载荷矩阵,矩阵中每个载荷向量反映个体的异质性特征。估计得到载荷向量后即可对不同个体实现聚类划分。本文提出的方法具有优良的统计学效率,能够证明在一定假设条件下聚类误差率能够随着观测个体数目的发散而趋近于0。基于模拟数据和股票收益、大气污染实际数据的实验表明,该方法能够区分具有不同特征模式的个体,解决多维数据的分布函数聚类问题,并为金融风险管理、空气质量的差异化治理等现实问题提供决策支持。With the rapid development of information technology,clustering analyses are now facing challenges in the increasing scale of datasets and complex data scenarios.In many applications,there exist inherent relations or hierarchical nesting structures in the observed data.As a result,traditional clustering methods,most of which assume that samples are independently and identically distributed,can hardly be applied.The common solution is feature engineering,in which to generate a low-dimensional feature vector for every object and then conduct clustering using the feature vectors.Nevertheless,this compression leads to inevitable information loss.In order to make full use of observed data,we consider distribution function to describe clustering objects to avoid information loss.Then we propose distributional factor model(DFM)based on Gaussian mixture model,and decompose the observed data to two parts.The first part is a set of common factors,which are expressed as Gaussian components to reflect the common patterns among the whole dataset.The second part is a loading matrix,each row of which is a loading vector corresponding to an individual object to show the object's heterogeneous features.Clustering could be conducted based on the loading vectors for all objects.The statistical properties of the proposed clustering method are investigated.We show that under certain hypothesis,the clustering error rate converges to 0 as the number of objects increases.Simulations and empirical studies based on stock data and air pollution data demonstrate that DFM is useful for the partition of objects with different features and the solution to clustering the distribution function of multidimensional data.Thus,DFM could provide decision-making support for real applications such as financial risk management and differentiated pollution control in air quality.
分 类 号:O212[理学—概率论与数理统计]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49