基于机器学习的茶树DNA聚类算法

Machine learning based clustering algorithm for tea tree DNA

作　　者：杨小平[1] 倪萍诸葛天秋[3] 罗跃新[3] 郭春雨[3] 庞月兰[3] 吴雨婷 YANG Xiaoping;NI Ping;ZHUGE Tianqiu;LUO Yuexin;GUO Chunyu;PANG Yuelan;WU Yuting(College of Information Science and Engineering,Guilin University of Technology,Guilin 541004,China;Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin University of Technology,Guilin 541004,China;Tea Science and Research lnstitute of GhiLin,Guilin 541004,China)

机构地区：[1]桂林理工大学信息科学与工程学院,广西桂林541004 [2]桂林理工大学广西嵌入式技术与智能系统重点实验室,广西桂林541004 [3]广西桂林茶叶科学研究所,广西桂林541004

出　　处：《广西大学学报（自然科学版）》2024年第2期386-399,共14页Journal of Guangxi University（Natural Science Edition）

基　　金：广西科技计划项目(桂科AD18281068);广西科技重大专项(桂科AA203020184)。

摘　　要：为了研究茶树基因序列的聚类问题,设计一种基于累计方差贡献率进行改进的核主成分分析(KPCA)与k均值(k-means)++聚类算法相结合的降维聚类算法(KPCA-k-means++)。将基因库数据集筛选分组后,利用k-mers算法提取基因数据的数据特征,根据累计方差贡献率的占比大于85%的标准确定降维主元个数对KPCA进行降维改进并采用k-means++算法对降维后数据聚类,通过CH(Calinski-Harabaze Index)指标和响应时间分析聚类结果。结果表明:在单独聚类、KPCA聚类、改进PCA聚类、改进KPCA聚类4种处理方式中,改进KPCA-k-means++算法在不同处理方式和不同样本数的对比下,CH指标均为最高,与未改进时相比平均高出33%。在响应时间方面,改进KPCA-k-means++算法与同样改进PCA-k-means++算法在不同聚类数和样本数的对比下响应时间均较短。改进KPCA-k-means++算法能够保证对于茶树的基因序列的聚类准确率和聚类速度,表现出极好的聚类稳定性。In order to study the clustering problem of tea tree gene sequences,this paper designs an improved kernel principal component analysis(KPCA)with k-means++for dimensionality reduction clustering algorithm based on the cumulative variance contribution rate.Firstly,the gene pool dataset was filtered and grouped,then the data features of the gene data were extracted using the k-mers algorithm,and then the KPCA was improved by selecting the feature principal components with a contribution rate greater than 85%according to the percentage of the cumulative variance contribution rate,and then the clustering operation was implemented by the k-means++method,and finally the clustering results were analysed by the Calinski-Harabasz index and response time.The experimental results showed that the combined method had the highest Calinski-Harabasz Index for different sample sizes compared to the four treatments of clustering alone,KPCA-clustering,improved PCA-clustering and improved KPCA-clustering.In terms of response time compared to the same improved PCA-k-means++,the clustering speed was effectively reduced.The improved KPCA-k-means++was able to guarantee the clustering accuracy and clustering speed for the gene sequences of tea trees,and showed excellent clustering stability.

关键词：核主成分分析累计方差贡献率 K均值聚类算法基因聚类

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于机器学习的茶树DNA聚类算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于机器学习的茶树DNA聚类算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索