检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:曲豫宾 陈翔[2] QU Yu-bin;CHEN Xiang(Jiangsu College of Engineering and Technology,Nantong 226007,China;Department of Information Science and Technology,Nantong University,Nantong 226019,China)
机构地区:[1]江苏工程职业技术学院,江苏南通226007 [2]南通大学
出 处:《通化师范学院学报》2019年第12期66-72,共7页Journal of Tonghua Normal University
基 金:南通市市级科技项目(JC2018134)
摘 要:主动学习查询策略有助于从未标注数据中选择能够提高分类模型性能指标的样例,减少人工标注陈本,基于期望损失最小化的主动学习查询策略有助于选择未标注实例,然而该策略存在计算复杂度高,随机采样性能不稳定等问题,因此,从信息熵具有较强衡量未标注样本的信息量出发,提出基于信息熵抽样估计的统计学习查询策略,该策略使用已标注样例得到的训练模型对未标注实例池中每个样例计算信息熵,选择若干不确定度最高样例并计算相应数据分布的期望经验风险,选择使期望经验风险最小的样例进行标注.在公开的UCI机器学习数据集(包括tic-tac-toe、transfusion、kr-vs-kp、diagnosis、breast-cancer等)上针对不同标注比例(比如20%、40%、60%、80%、100%),以及不同的分类器(比如随机森林、逻辑斯蒂回归等)进行实证研究表明,相对于随机采样策略,该策略计算复杂度从O(N2)降低为O(Q×N),ACCURACY指标在最好情况下最高提升6%.The active learning query strategy is helpful to select examples from the unlabeled dataset that can improve the performance of the classification model,and reduce manual labeling cost. The active learning query strategy based on the minimization of expected loss was helpful to select unlabeled instances. However,this strategy had high computational complexity and unstable sampling performance. Therefore,query strategy based on statistical learning from information entropy sampling estimation was proposed because of information entropy with strong measure for unlabeled instances. The strategy used the training model obtained by the labeled example to calculate the information entropy for each instance in the unlabeled instance pool,the instances with highest degree of uncertainty were selected and the expected empirical risk of the corresponding data distribution was calculated. The corresponding instance was selected rending the lowest expected empirical risk. Empirical research on different percentage of queried instances(such as 20%、40%、60%、80%、100%)and different classifiers(including random forest、logistic classifier)was conducted on the public UCI machine learning datasets(including tic-tac-toe、transfusion、kr-vs-kp、diagnosis、breast-cancer). Empirical result shows that this strategy can effectively reduce the computational complexity from O(N2)to O(Q × N)compared to the random sampling strategy. The ACCURACY performance is the promoted by 6% in best case.
分 类 号:TP311.5[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.185