检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨小平[1] 倪萍 诸葛天秋[3] 罗跃新[3] 郭春雨[3] 庞月兰[3] 吴雨婷 YANG Xiaoping;NI Ping;ZHUGE Tianqiu;LUO Yuexin;GUO Chunyu;PANG Yuelan;WU Yuting(College of Information Science and Engineering,Guilin University of Technology,Guilin 541004,China;Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin University of Technology,Guilin 541004,China;Tea Science and Research lnstitute of GhiLin,Guilin 541004,China)
机构地区:[1]桂林理工大学信息科学与工程学院,广西桂林541004 [2]桂林理工大学广西嵌入式技术与智能系统重点实验室,广西桂林541004 [3]广西桂林茶叶科学研究所,广西桂林541004
出 处:《广西大学学报(自然科学版)》2024年第2期386-399,共14页Journal of Guangxi University(Natural Science Edition)
基 金:广西科技计划项目(桂科AD18281068);广西科技重大专项(桂科AA203020184)。
摘 要:为了研究茶树基因序列的聚类问题,设计一种基于累计方差贡献率进行改进的核主成分分析(KPCA)与k均值(k-means)++聚类算法相结合的降维聚类算法(KPCA-k-means++)。将基因库数据集筛选分组后,利用k-mers算法提取基因数据的数据特征,根据累计方差贡献率的占比大于85%的标准确定降维主元个数对KPCA进行降维改进并采用k-means++算法对降维后数据聚类,通过CH(Calinski-Harabaze Index)指标和响应时间分析聚类结果。结果表明:在单独聚类、KPCA聚类、改进PCA聚类、改进KPCA聚类4种处理方式中,改进KPCA-k-means++算法在不同处理方式和不同样本数的对比下,CH指标均为最高,与未改进时相比平均高出33%。在响应时间方面,改进KPCA-k-means++算法与同样改进PCA-k-means++算法在不同聚类数和样本数的对比下响应时间均较短。改进KPCA-k-means++算法能够保证对于茶树的基因序列的聚类准确率和聚类速度,表现出极好的聚类稳定性。In order to study the clustering problem of tea tree gene sequences,this paper designs an improved kernel principal component analysis(KPCA)with k-means++for dimensionality reduction clustering algorithm based on the cumulative variance contribution rate.Firstly,the gene pool dataset was filtered and grouped,then the data features of the gene data were extracted using the k-mers algorithm,and then the KPCA was improved by selecting the feature principal components with a contribution rate greater than 85%according to the percentage of the cumulative variance contribution rate,and then the clustering operation was implemented by the k-means++method,and finally the clustering results were analysed by the Calinski-Harabasz index and response time.The experimental results showed that the combined method had the highest Calinski-Harabasz Index for different sample sizes compared to the four treatments of clustering alone,KPCA-clustering,improved PCA-clustering and improved KPCA-clustering.In terms of response time compared to the same improved PCA-k-means++,the clustering speed was effectively reduced.The improved KPCA-k-means++was able to guarantee the clustering accuracy and clustering speed for the gene sequences of tea trees,and showed excellent clustering stability.
关 键 词:核主成分分析 累计方差贡献率 K均值聚类算法 基因聚类
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.87