检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:黄雅婕 翟俊海[1,2] 周翔 李艳[1,2,3] HUANG Yajie;ZHAI Junhai;ZHOU Xiang;LI Yan(College of Mathematics and Information Science,Baoding Hebei 071002,China;Key Laboratory of Machine Learning and Computational Intelligence(Hebei University),Baoding Hebei 071002,China;Research Center for Applied Mathematics and Interdisciplinary Sciences,Beijing Normal University at Zhuhai,Zhuhai Guangzhou 519087,China)
机构地区:[1]河北大学数学与信息科学学院,河北保定071002 [2]河北省机器学习与计算智能重点实验室(河北大学),河北保定071002 [3]北京师范大学珠海校区应用数学与交叉科学研究中心,广东珠海519087
出 处:《计算机应用》2022年第2期389-394,共6页journal of Computer Applications
基 金:河北省科技计划项目重点研发专项(19210310D);河北省自然科学基金资助项目(F2018201096);河北大学研究生创新资助项目(hbu2019ss077)。
摘 要:随着数据的海量型增长,如何存储并利用数据成为目前学术研究和工业应用等方面的热门问题。样例选择是解决此类问题的方法之一,它在原始数据中依据既定规则选出代表性的样例,从而有效地降低后续工作的难度。基于此,提出一种基于哈希学习的投票样例选择算法。首先通过主成分分析(PCA)方法将高维数据映射到低维空间;然后利用k-means算法结合矢量量化方法进行迭代运算,并将数据用聚类中心的哈希码表示;接着将分类后的数据按比例进行随机选择,在多次独立运行算法后投票选择出最终的样例。与压缩近邻(CNN)算法和大数据线性复杂度样例选择算法LSH-IS-F相比,所提算法在压缩比方面平均提升了19%。所提算法思想简单容易实现,能够通过调节参数自主控制压缩比。在7个数据集上的实验结果显示所提算法在测试精度相似的情况下在压缩比和运行时间方面较随机哈希有较大优势。With the massive growth of data,how to store and use data has become a hot issue in academic research and industrial applications.As one of the methods to solve these problems,instance selection effectively reduces the difficulty of follow-up work by selecting representative instances from original data according to the established rules.Therefore,a voting instance selection algorithm based on learning to hash was proposed.Firstly,the Principal Component Analysis(PCA)method was used to map high-dimensional data to low-dimensional space.Secondly,the k-means algorithm was used to perform iterative operations by combining with the vector quantization method,and the hash codes of the cluster center were used to represent the data.After that,the classified data were randomly selected according to the proportion,and the final instances were selected by voting after several times independent running of the algorithm.Compared with the Compressed Nearest Neighbor(CNN)algorithm and the instance selection algorithm of linear complexity for big data named LSH-IS-F(Instance Selection algorithm by Hashing with two passes),the proposed algorithm has the compression ratio improved by an average of 19%.The idea of the proposed algorithm is simple and easy to implement,and the algorithm can control the compression ratio automatically by adjusting the parameters.Experimental results on 7 datasets show that the proposed algorithm has a great advantage compared to random hashing in terms of compression ratio and running time with similar test accuracy.
关 键 词:样例选择 哈希学习 海明距离 矢量量化 投票方法
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.248