基于信息熵抽样估计的统计学习查询策略

Active Learning through Sampling Estimation of Expected Error Reduction Based on Information Entropy

作　　者：曲豫宾陈翔[2] QU Yu-bin;CHEN Xiang(Jiangsu College of Engineering and Technology,Nantong 226007,China;Department of Information Science and Technology,Nantong University,Nantong 226019,China)

机构地区：[1]江苏工程职业技术学院,江苏南通226007 [2]南通大学

出　　处：《通化师范学院学报》2019年第12期66-72,共7页Journal of Tonghua Normal University

基　　金：南通市市级科技项目(JC2018134)

摘　　要：主动学习查询策略有助于从未标注数据中选择能够提高分类模型性能指标的样例,减少人工标注陈本,基于期望损失最小化的主动学习查询策略有助于选择未标注实例,然而该策略存在计算复杂度高,随机采样性能不稳定等问题,因此,从信息熵具有较强衡量未标注样本的信息量出发,提出基于信息熵抽样估计的统计学习查询策略,该策略使用已标注样例得到的训练模型对未标注实例池中每个样例计算信息熵,选择若干不确定度最高样例并计算相应数据分布的期望经验风险,选择使期望经验风险最小的样例进行标注.在公开的UCI机器学习数据集(包括tic-tac-toe、transfusion、kr-vs-kp、diagnosis、breast-cancer等)上针对不同标注比例(比如20%、40%、60%、80%、100%),以及不同的分类器(比如随机森林、逻辑斯蒂回归等)进行实证研究表明,相对于随机采样策略,该策略计算复杂度从O(N2)降低为O(Q×N),ACCURACY指标在最好情况下最高提升6%.The active learning query strategy is helpful to select examples from the unlabeled dataset that can improve the performance of the classification model,and reduce manual labeling cost. The active learning query strategy based on the minimization of expected loss was helpful to select unlabeled instances. However,this strategy had high computational complexity and unstable sampling performance. Therefore,query strategy based on statistical learning from information entropy sampling estimation was proposed because of information entropy with strong measure for unlabeled instances. The strategy used the training model obtained by the labeled example to calculate the information entropy for each instance in the unlabeled instance pool,the instances with highest degree of uncertainty were selected and the expected empirical risk of the corresponding data distribution was calculated. The corresponding instance was selected rending the lowest expected empirical risk. Empirical research on different percentage of queried instances(such as 20%、40%、60%、80%、100%)and different classifiers(including random forest、logistic classifier)was conducted on the public UCI machine learning datasets(including tic-tac-toe、transfusion、kr-vs-kp、diagnosis、breast-cancer). Empirical result shows that this strategy can effectively reduce the computational complexity from O(N2)to O(Q × N)compared to the random sampling strategy. The ACCURACY performance is the promoted by 6% in best case.

关键词：信息熵主动学习统计学习

分类号：TP311.5[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于信息熵抽样估计的统计学习查询策略

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于信息熵抽样估计的统计学习查询策略

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索