一种面向科技文献元数据增量数据规范的多模式匹配算法  被引量:2

A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature

在线阅读下载全文

作  者:董美 常志军 张润杰 Dong Mei;Chang Zhijun;Zhang Runjie(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190,China;Electronics and Computer Science,University of Southampton,Southampton SO171BJ,UK)

机构地区:[1]中国科学院文献情报中心,北京100190 [2]中国科学院大学经济与管理学院图书情报与档案管理系,北京100190 [3]南安普顿大学电子与计算机科学学院,南安普顿SO17 1BJ

出  处:《数据分析与知识发现》2021年第6期135-144,共10页Data Analysis and Knowledge Discovery

基  金:中国科学院文献情报能力建设项目(项目编号:Y9100901)的研究成果之一。

摘  要:【目的】针对期刊文献元数据日增的小规模数据,设计一种基于Hash的多模式匹配算法,对其机构信息利用大规模的模式集进行规范化。【方法】使用Hash定位模式串,减少对系统内存的占用;抽取模式串的首个单词/字结合Word跳步匹配,减少匹配次数,加大跳转幅度,从而提升多模式匹配的效率。【结果】以CSCD机构库182万条数据作为模式集的实验中,该算法与Aho-Corasick(AC)算法对比,能够较为快速地构建模式集对应的字典;在字符集规模约为1万条时,有更优越的时间性能,尤其是英文语料下有9.39%时间性能的提升;与Wu-Manber(WM)算法相比,该算法不受最短模式串限制。【局限】针对不同的模式集和字符集,需要对算法或数据进行调整;该算法及其拓展的无首词模式,均不适用于模式集较小、字符集较大的场景。【结论】该算法可以应用于中文、英文、中英混合的文本,在模式集较大(106级)、字符集较小(1万左右)的情况下,有超越经典算法AC算法(0.08%-30.41%)和WM算法时间性能的表现。[Objective]This paper designs a multiple pattern matching algorithm to standardize the institutional information of sci-tech literature metadata.[Methods]First,we used the Hash function to locate the pattern strings and reduced the system memory usage.Then,we extracted the first words of the pattern strings,which were combined with word skipping matching.The new algorithm reduced the number of matches and increased the jump range,which improved the efficiency of multiple pattern matching.[Results]We examined our model with the CSCD’s institutional library as the pattern string set.Compared with the Aho-Corasick(AC)algorithm,our method quickly constructed the dictionary corresponding to the pattern string sets.When the data volume reached about 10000,our model spent less time on the same tasks.For the English corpus,there was a 9.39%improvement in time performance.Compared with the Wu-Manber(WM)algorithm,our method was not restricted by the shortest pattern strings.[Limitations]The algorithm or data needs to be adjusted for different pattern strings and text strings.This algorithm and the extended headless mode are not suitable for small pattern string sets with large string sets.[Conclusions]The algorithm can be applied to Chinese,English,and ChineseEnglish mixed texts.The time performance of our algorithm is superior to the AC and WM algorithms in processing large pattern string set(106)and small string set(about 10,000).

关 键 词:模式匹配 数据规范化 名称规范 哈希算法 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象