基于稳健距离的大数据Logistic回归最优子抽样  被引量:1

Optimal Subsampling for Big Data Logistic Regression Based on Robust Distance

在线阅读下载全文

作  者:韩潇 王明秋[1] 赵胜利[1] Han Xiao;Wang Mingqiu;Zhao Shengli(School of Statistics and Data Science,Qufu Normal University,Qufu Shandong 273165,China)

机构地区:[1]曲阜师范大学统计与数据科学学院,山东曲阜273165

出  处:《统计与决策》2024年第15期59-64,共6页Statistics & Decision

基  金:国家自然科学基金面上项目(12271294,12171277)。

摘  要:大数据统计分析在有限的计算资源下面临一些挑战性问题,用子数据代替全数据进行统计分析成为一种选择。文章基于最小协方差行列式的稳健距离,为大数据Logistic回归模型提出了一种更高效的子数据选择算法。通过大量的数值模拟,在不同的标准下比较了所提算法与其他已有算法的性能。结果表明,所提算法具有较高的估计效率和计算效率,与全数据相比,计算时间显著减少。与其他算法相比,所提算法得到的子数据信息矩阵行列式的值更大。同时,当协变量之间存在高度相关性时,所提算法具有稳健性。最后,通过对实际数据集的分析,说明了所提算法的预测误差更小。The statistical analysis of big data is faced with some challenging problems under the limited computing resources,so it is a choice to use sub-data instead of full data for statistical analysis.Based on the robust distance of the minimum covariance determinant,this paper proposes a more efficient sub-data selection algorithm for logistic regression models with big data,then conducts a large number of numerical simulations,and compares the performance of the proposed algorithm with that of other existing algorithms under different criteria.The results are shown as below:The proposed algorithm has higher estimation efficiency and computational efficiency,and has a significant reduction in computational time compared with the full data.The value of the determinant of the sub-data information matrix obtained by the proposed algorithm is larger than those obtained by other algorithms.Meanwhile,the proposed method is robust when there is a high correlation between covariates.Finally,the analysis is made onthe actual data set,which shows that the proposed algorithm has smaller prediction error.

关 键 词:最小协方差行列式 信息矩阵 最优子抽样 

分 类 号:O212.2[理学—概率论与数理统计]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象