基于概率话题模型的微生物菌群结构研究  

Microbial flora structure based on probability topic model

在线阅读下载全文

作  者:王侠林 左赞 周兰平[3] 朱林 范红 孔祥阳 贺建峰[1] 

机构地区:[1]昆明理工大学信息工程与自动化学院,昆明650500 [2]昆明理工大学附属医院,云南省第一人民医院消化科,昆明650532 [3]昆明市延安医院,昆明医科大学附属延安医院急诊科,昆明650051

出  处:《中国科学:生命科学》2017年第11期1220-1234,共15页Scientia Sinica(Vitae)

基  金:国家自然科学基金(批准号:81260077,81560107,11265007)资助

摘  要:微生物菌群结构的异质性在影响宿主健康与疾病等方面有着十分重要的作用.对于菌群结构的时间与空间尺度异质性研究主要有非监督学习算法以及监督学习算法.由于菌群数据特性与文本数据特性之间的相似性,本文采用非监督学习的LDA概率话题模型对菌群结构的时间异质性进行研究,并与系统聚类和K-Means聚类这两种方法进行比较.采用LDA模型折叠Gibbs抽样的蒙特卡洛算法对两种数据源北平顶猴(Macaca leonina)阴道菌群(MVB)和轻微型肝性脑病(MHE)菌群的时间异质性OTUs数据集进行了分析.用LDA模型分别将MVB和MHE数据源中的27个样本和77个样本的OTUs数据集分为6个Topic和4个Topic.这与系统聚类和K-Means聚类划分成的簇数目(分别为5,3与4,3)有所不同.此外,实验表明结合MVB样本间生理数据-pH和MHE中样本α多样性,pH和α值的分类相似性更能与LDA模型的样本分类特性保持一致.因此,LDA在样本的聚集程度上更能精确地对OTUs数据集进行分类.更为重要的是,LDA模型还可以鉴定出每个Topic中具有代表性的OTUs.与系统聚类和K-Means聚类方法相比较,LDA模型不仅能更为有效地量化菌群结构的异质性,还能鉴定出相对应影响异质性的OTUs.The heterogeneity of microbial flora structures plays an important role in the health and disease of the host. With respect to the temporal and spatial heterogeneity of the flora structure, both unsupervised and supervised learning algorithms have been developed. Because of the similarity of the characteristics of the flora data and the text data, in this paper, we investigate the temporal heterogeneity of the flora structure by applying the latent Dirichlet allocation (LDA) probability topic model for unsupervised learning. We then use system and K-Means clustering to compare these two methods. Two kinds of data sources of Beipingding monkey vaginal flora (MVB) and minimal hepatic encephalopathy (MHE) bacteria heterogeneity operational taxonomic unit (OTUs) data sets are analyzed by the Monte Carlo LDA model with the folding Gibbs sampling. We used the LDA model to divide the 27 samples and 77 sample OTUs in the MVB and MHE data sources, respectively, into six topics and four topics, which differ from the number of clusters (5, 3, and 4, 3) divided by system and K-Means clustering. In addition, experimental results show that the classification similarity of sample diversity, pH value with the physiological data-pH in MVB samples and the similarity of ct value in MHE, the classification similarity of the pH and the ct is consistent with the classification characteristics of LDA model. As such, the LDA model classifies the OTUs data sets more accurately with respect to the degree of aggregation of the samples. More importantly, the LDA model can also identify representative OTUs in each topic. Compared with the system clustering and K-Means clustering methods, the LDA model can not only quantify the heterogeneity of the flora structure more effectively, but also identify the corresponding heterogeneity of the OTUs.

关 键 词:LDA模型 GIBBS抽样 蒙特卡洛算法 系统聚类 K-MEANS聚类 

分 类 号:R37[医药卫生—病原生物学] TP311.13[医药卫生—基础医学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象