大规模单细胞转录组测序数据的聚类方法比较  

Comparison of Clustering Methods for Large-scale Single-cell RNA-sequencing Data

在线阅读下载全文

作  者:朱晓姝 蒙霜 龙法宁 ZHU Xiaoshu;MENG Shuang;LONG Faning(School of Computer Science and Engineering,Guangxi Normal University,Guilin,Guangxi,541004,China;School of Computer Science and Engineering,Yulin Normal University,Yulin,Guangxi,537000,China)

机构地区:[1]广西师范大学计算机科学与工程学院,广西桂林541004 [2]玉林师范学院计算机科学与工程学院,广西玉林537000

出  处:《广西科学》2023年第4期764-775,共12页Guangxi Sciences

基  金:国家自然科学基金项目(62141207)资助。

摘  要:单细胞转录组测序(single-cell RNA-sequencing, scRNA-seq)数据具有高稀疏性、高噪声、高维度、结构信息和位置信息缺乏等特点,且数据规模迅速增大,使得单细胞聚类面临较大的挑战。为便于对不同的scRNA-seq数据选择合适的分析方法,本研究对scRNA-seq数据的质量控制、基因选择和聚类等方法进行比较分析。首先,分析质量控制中过滤和归一化的方法及其阈值设置;然后,从模型因子、测序技术、方法局限性和优势等方面,对6种典型的基因选择方法进行比较;最后,详细阐述6种典型的单细胞聚类方法,并分析其适用的数据规模和优缺点。收集14个带有真实标签的金标准scRNA-seq数据集,包括5个全长测序数据集和9个双端测序数据集,其中5个数据集包含的细胞数大于3 000个,对6种典型的基因选择方法和6种单细胞聚类方法进行实验比较,分析它们在识别高差异基因时和在聚类性能上的差异。结果发现,不同的基因选择方法在Adam和Wang_Lung数据集分别可以检测到182个和124个共有基因,以及一些独有基因。此外,Seurat、SC3、Monocle 3和scDeepCluster的聚类稳定性更好,Seurat在所有数据集上的聚类稳定性和准确性最好,scDeepCluster在大部分数据集上有很好的聚类准确性。因此,选择合适的scRNA-seq数据分析方法,需要综合考虑测序平台、数据规模,以及基因表达分布等因素。Single-cell RNA-sequencing(scRNA-seq)data has the characteristics of high sparseness,high noise,high dimension,lack of structural information and location information,and the scale of data increases rapidly,which makes single-cell clustering face great challenges.In order to facilitate the selection of appropriate analysis methods for different scRNA-seq data,this study compared and analyzed the quality control,gene selection and clustering methods of scRNA-seq data.Firstly,the method of filtering and normalization in quality control and its threshold setting are analyzed.Then,six typical gene selection methods were compared from the aspects of model factors,sequencing technology,method limitations and advantages.Finally,6 typical single-cell clustering methods are described in detail,and their applicable scale of datasets,advantages and disadvantages are analyzed.14 scRNA-seq datasets with real labels were collected,including 5 full-length sequencing datasets and 9 double-ended sequencing datasets,among which 5 datasets were larger than 3000 cells.6 typical gene selection methods and 6 single-cell clustering methods were compared experimentally to analyze their differences in identifying highly differentially expressed genes and clustering performance.The results showed that different gene selection methods could detect 182 and 124 common genes,as well as some unique genes in Adam and Wang_Lung datasets,respectively.In addition,Seurat,SC3,Monocle 3 and scDeepCluster have better clustering stability.Seurat has the best clustering stability and accuracy on all data sets,and scDeepCluster has good clustering accuracy on most datasets.Therefore,selecting the appropriate scRNA-seq data analysis method requires comprehensive consideration of factors such as sequencing platform,data size,and gene expression distribution.

关 键 词:单细胞转录组测序数据 质量控制 基因选择 聚类 细胞类型识别 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象