融合过滤和相似度计算的高错误率基因组数据敏感序列识别

Recognizing Sensitive Sequences from Genomic Data with High Error Rate Integrating Filter and Similarity Calculation

作　　者：孙辉钟诚 SUN Hui;ZHONG Cheng(School of Computer,Electronics and Information,Guangxi University,Nanning 530004,China;Key Laboratory of Parallel Distributed Computing Technology in Guangxi Universities,Nanning 530004,China)

机构地区：[1]广西大学计算机与电子信息学院,南宁530004 [2]广西高校并行分布式计算技术重点实验室,南宁530004

出　　处：《小型微型计算机系统》2023年第6期1227-1235,共9页Journal of Chinese Computer Systems

基　　金：国家自然科学基金项目(61962004,61462005)资助;广西研究生教育创新计划项目(YCSW2021020)资助.

摘　　要：为解决现有算法难以有效识别高错误率测序数据中敏感序列的问题,提出一种融合过滤和相似度计算的敏感序列识别算法.首先,分割待识别序列为多条短序列,通过构建双布隆过滤器,对短序列进行动态过滤去重,以避免重复运算;然后,对短序列局部片段进行k-mer编码,改进优化短序列局部片段相似性度量的方法,以准确识别短串联重复序列;其次,对短序列进行k-mer编码并与GWAS Catalog数据库中敏感序列进行计算比对,以准确识别疾病相关序列;最后,依据短序列识别结果,生成待识别序列的两条掩码序列,作为识别测序数据中敏感序列的结果.实验结果表明,与同类算法LRF和SRF相比,本文算法对错误率2%~20%的测序数据中敏感序列的平均识别准确率分别提高1.96%和3.66%,查准率分别提高40.08%和68.36%,有效提升高错误率基因组数据中敏感序列识别的效果.To solve the problem that existing algorithms are difficult to effectively identify sensitive sequences from sequencing data with high error rate,a recognizing sensitive sequence algorithm using filter and similarity calculation is proposed.Firstly,the genomic sequence is divided into several short sequences,and a double Bloom filter is constructed to de-duplicate each short sequence.Secondly,the local fragments of short sequences are encoded by k-mer,and the method for computing similarity of local fragments of short sequences are improved to identify short tandem repeats.Thirdly,k-mer encoding short sequences and sensitive sequences in GWAS Catalog database are aligned to identify disease-related sequences.Finally,according to the results of short sequence identification,two mask sequences of the sequencing data are generated as the final results of identifying sensitive sequences from the sequencing data.Experimental results show that compared with existing algorithms LRF and SRF,our proposed algorithm can enhance the average accuracy 1.96%and 3.66%and precision 40.08%and 68.36%of recognizing sensitive sequences from sequencing data with 2%~20%error rate,respectively.The proposed algorithm can effectively improve the effect of recognizing sensitive sequences of genome data with high error rate.

关键词：敏感序列识别皮尔逊相关系数过滤相似度计算比对

分类号：TP301[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合过滤和相似度计算的高错误率基因组数据敏感序列识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合过滤和相似度计算的高错误率基因组数据敏感序列识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索