一种基于指令流水线的数据匹配算法  

A data matching algorithm based on instruction pipeline

作  者:杨嘉佳 李正 郑儿 赵静 燕玮 刘金 Yang Jiajia;Li Zheng;Zheng Er;Zhao Jing;Yan Wei;Liu Jin(The Sixth Research Institute of China Electronics Corporation,Beijing 100083,China)

机构地区:[1]中国电子信息产业集团有限公司第六研究所,北京100083

出  处:《电子技术应用》2025年第2期81-85,共5页Application of Electronic Technique

摘  要:基于正则表达式的数据匹配技术在基础数据治理和清洗方面有着重要的应用价值。然而,在高性能计算领域的数据处理过程中因算法匹配吞吐率低,无法满足大数据处理环境下对算法的高性能要求,造成其应用范围受限。针对此现象,提出一种基于指令流水线的数据匹配算法,称之为γFA:利用Intel架构内置的向量指令流水式读入若干字符段,通过大宽度向量比较函数进行字符段与非信任字符集的流水比值处理并转换成整型向量,通过位置定位函数累加定位出所有整型向量的首个非信任字符位置,计算出可略过的总字符数,减少正则表达式匹配引擎因处理非信任字符集导致访问低速内存而带来巨大的时间开销,实现正则表达式匹配算法的性能提升。实验结果表明,γFA算法的吞吐率是原始DFA算法的15.88~53.06倍,相比于?FA算法,吞吐率提升了35.12%~63.26%,取得较好的性能加速效果。此外,通过对γFA算法进行优化后,性能可接近100 Gb/s,为原始DFA匹配算法性能的15.88~64.94倍,相比于γFA算法性能提升了2.15%~43.09%。The data matching technology based on regular expressions has significant application value in basic data governance and cleaning.However,in the data processing process of high-performance computing,the low performance of algorithm matching cannot meet the high-performance requirements of algorithms in the big data processing environment,resulting in limited application scope.To address this issue,a high-performance data matching algorithm based on instruction pipelining is proposed,known asγFA.It utilizes the vector instruction pipelining built into the Intel architecture to read in multiple character segments,performs pipeline ratio processing of the character segments with untrusted character sets through a wide-width vector comparison function,and converts them into integer vectors.The position location function is then used to accumulate and locate the first untrusted character position in the integer vector,calculate the number of characters that can be skipped,and reduce the significant time overhead caused by the regular expression matching engine accessing slow memory when processing untrusted character sets.This achieves performance acceleration for the regular expression matching algorithm.Experimental results show that theγFA algorithm achieves a throughput rate that is 15.88 to 53.06 times higher than the original DFA algorithm.Compared to theßFA algorithm,the throughput rate is improved by 35.12%to 63.26%,achieving a better performance acceleration effect.Furthermore,after optimizing theγFA algorithm,a performance close to 100 Gb/s can be achieved,which is 15.88 to 64.94 times better than the performance of the original DFA matching algorithm.This represents an improvement of 2.15%to 43.09%compared to theγFA algorithm.

关 键 词:正则表达式匹配 指令流水 高性能数据匹配 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象