利用Transformer的组合聚类算法在蛋白质数据分析中的应用  

Application of combinatorial clustering algorithm in protein data analysis using Transformer

在线阅读下载全文

作  者:陈祥龙 李海军[1,2] 赵福军 袁媛 CHEN Xianglong;LI Haijun;ZHAO Fujun;YUAN Yuan(School of Information and Intelligent Engineering,University of Sanya,Sanya 572022,China;Academician Guoliang Chen Team Innovation Center,University of Sanya,Sanya 572022,China)

机构地区:[1]三亚学院信息与智能工程学院,海南三亚572022 [2]三亚学院陈国良院士团队创新中心,海南三亚572022

出  处:《无线互联科技》2024年第14期74-81,共8页Wireless Internet Technology

基  金:三亚学院硕士研究生导师“产教融合”研究项目,项目编号:USY23CJRH03。

摘  要:该研究将Transformer模型适配于蛋白质特征降维场景,通过其特有的自注意力机制,赋予模型对长程依赖关系的较好建模性能,同时,多头注意力设计使得模型能够从不同角度捕获特征间的相互作用,进一步提升降维结果的表达力和鲁棒性。文章提出了一种新型的GRKM组合聚类算法,在原始K-means算法中引入了灰狼优化算法(Grey Wolf Optimization Algorithm)确定聚类的K值,以随机游走算法(Random Walk)确定初始聚类中心,以马氏距离(Markov Distance)来衡量样本间的相似性。研究中,对5种具有代表性的蛋白质数据集进行了实验验证,得到了改进后算法在轮廓系数以及DB指数等方面相较于改进前都有较大提升的结论。最终的结果分析选取APP蛋白质数据,将蛋白质聚为8类,探讨了各类别的生物功能,在解释性方面也取得了较为明显的效果。所提算法为深入理解蛋白质功能、发现潜在生物标志物以及指导药物设计等实际应用提供了参考工具。In this study,the Transformer model is adapted to the protein feature dimensionality reduction scenario,which endows the model with better modeling performance for long-range dependencies through its unique self-attention mechanism,and at the same time,the multi-attention design enables the model to capture the interactions between features from different perspectives,which further enhances the expressiveness and robustness of the dimensionality reduction results.A novel GRKM combinatorial clustering algorithm is studied and experimented,which introduces a Grey Wolf Optimization Algorithm into the original K-means algorithm to determine the K value of the clusters,and a Random Walk algorithm to determine the initial cluster centers,and the Markov Distance to measure the similarity between samples.In the study,five representative protein datasets are experimentally validated,and it is concluded that the improved algorithm has a substantial improvement in the profile coefficient as well as DB index compared with the pre-improved one.The final result analysis selects APP protein data,clusters the proteins into eight categories,explores the biological functions of each category,and achieves more obvious results in terms of interpretability.The algorithm in this paper provides a reference tool for practical applications such as in-depth understanding of protein function,discovering potential biomarkers,and guiding drug design.

关 键 词:蛋白质序列 Transformer模型 聚类算法 马氏距离 随机游走 灰狼优化算法 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象