基于动态阈值和差异性检验的自训练算法  

Self-training algorithm based on dynamic threshold and difference test

在线阅读下载全文

作  者:吕佳[1,2] 邱鸿波 肖锋 LYU Jia;QIU Hongbo;XIAO Feng(College of Computer and Information Sciences,Chongqing Normal University,Chongqing 401331,China;Chongqing Digital Agriculture Service Engineering Technology Research Center,Chongqing 401331,China)

机构地区:[1]重庆师范大学计算机与信息科学学院,重庆401331 [2]重庆市数字农业服务工程技术研究中心,重庆401331

出  处:《智能系统学报》2024年第4期839-852,共14页CAAI Transactions on Intelligent Systems

基  金:国家自然科学基金重大项目(11991024);重庆市教委“成渝地区双城经济圈建设”科技创新项目(KJCX2020024);重庆市高校创新研究群体资助项目(CXQT20015).

摘  要:针对自训练算法在迭代训练分类器的过程中存在难以有效选取高置信度样本以及误标记样本错误累积的问题,本文提出了基于动态阈值和差异性检验的自训练算法。引入样本的局部离群因子,据此剔除有标签样本中的离群点以及分类标注无标签样本,依据标注分批次处理无标签样本,以使模型更易选取到高置信度的无标签样本;根据新增伪标签样本的数量和对比隶属度的变化,设计一种动态隶属度阈值函数,提升高置信度样本的质量;定义密集距离度量样本间的差异性,分别计算伪标签样本与同类和不同类样本之间的密集距离之和,从而找出不确定度高的伪标签样本,并将此类样本并入下轮训练的无标签样本集中,缓解误标记样本错误累积的问题。实验结果表明,该算法在12个UCI基准数据集上均取得理想效果。In the process of iterative training of the classifier by a self-training algorithm,it is difficult to effectively select high-confidence samples and there exists mislabeled samples error accumulation.To address the above issues,this paper proposes a self-training algorithm based on dynamic threshold and difference test.The local outlier factor of the sample is introduced to remove the outliers from the labeled samples,classify and label the unlabeled samples.The unlabeled samples are subsequently fed into the model in batches based on the assigned mark,allowing the model to more easily select high-confidence unlabeled samples.Further,a dynamic membership threshold function is designed based on the changes in the number of newly added pseudo-labeled samples and the contrast membership.This function aims to improve the quality of high-confidence samples.Finally,the dense distance is defined to measure the difference between samples.The sum of dense distances between pseudo-labeled samples and samples of the same class and different classes is calculated separately to find the pseudo-labeled samples with high uncertainty,and incorporate these samples into the unlabeled samples set of the next round of training,which alleviates error accumulation of mislabeled samples.The experimental results demonstrate effectiveness of this algorithm on 12 benchmark UCI datasets.

关 键 词:自训练算法 误标记样本 高置信度样本 动态阈值 差异性检验 局部离群因子 对比隶属度 密集距离 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象