检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:孙辉 钟诚 SUN Hui;ZHONG Cheng(School of Computer,Electronics and Information,Guangxi University,Nanning 530004,China;Key Laboratory of Parallel Distributed Computing Technology in Guangxi Universities,Nanning 530004,China)
机构地区:[1]广西大学计算机与电子信息学院,南宁530004 [2]广西高校并行分布式计算技术重点实验室,南宁530004
出 处:《小型微型计算机系统》2023年第6期1227-1235,共9页Journal of Chinese Computer Systems
基 金:国家自然科学基金项目(61962004,61462005)资助;广西研究生教育创新计划项目(YCSW2021020)资助.
摘 要:为解决现有算法难以有效识别高错误率测序数据中敏感序列的问题,提出一种融合过滤和相似度计算的敏感序列识别算法.首先,分割待识别序列为多条短序列,通过构建双布隆过滤器,对短序列进行动态过滤去重,以避免重复运算;然后,对短序列局部片段进行k-mer编码,改进优化短序列局部片段相似性度量的方法,以准确识别短串联重复序列;其次,对短序列进行k-mer编码并与GWAS Catalog数据库中敏感序列进行计算比对,以准确识别疾病相关序列;最后,依据短序列识别结果,生成待识别序列的两条掩码序列,作为识别测序数据中敏感序列的结果.实验结果表明,与同类算法LRF和SRF相比,本文算法对错误率2%~20%的测序数据中敏感序列的平均识别准确率分别提高1.96%和3.66%,查准率分别提高40.08%和68.36%,有效提升高错误率基因组数据中敏感序列识别的效果.To solve the problem that existing algorithms are difficult to effectively identify sensitive sequences from sequencing data with high error rate,a recognizing sensitive sequence algorithm using filter and similarity calculation is proposed.Firstly,the genomic sequence is divided into several short sequences,and a double Bloom filter is constructed to de-duplicate each short sequence.Secondly,the local fragments of short sequences are encoded by k-mer,and the method for computing similarity of local fragments of short sequences are improved to identify short tandem repeats.Thirdly,k-mer encoding short sequences and sensitive sequences in GWAS Catalog database are aligned to identify disease-related sequences.Finally,according to the results of short sequence identification,two mask sequences of the sequencing data are generated as the final results of identifying sensitive sequences from the sequencing data.Experimental results show that compared with existing algorithms LRF and SRF,our proposed algorithm can enhance the average accuracy 1.96%and 3.66%and precision 40.08%and 68.36%of recognizing sensitive sequences from sequencing data with 2%~20%error rate,respectively.The proposed algorithm can effectively improve the effect of recognizing sensitive sequences of genome data with high error rate.
关 键 词:敏感序列识别 皮尔逊相关系数 过滤 相似度计算 比对
分 类 号:TP301[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.171