基于Trie树查找和非关键词消除的中文机构名称归一化  

Normalization of Chinese Institutional Names Based on Trie Tree Search and UnessentialWords Elimination

在线阅读下载全文

作  者:赵静 姜树明 马启云 ZHAO Jing;JIANG Shuming;MAQiyun(School of Computer Science and Technology,Qilu University of Technology(Shandong Academy of Sciences),Jinan,Shandong 250000,China;Information Research Institute of Shandong Academy of Sciences,Qilu University of Technology(Shandong Academy of Sciences),Jinan,Shandong 250000,China)

机构地区:[1]齐鲁工业大学(山东省科学院)计算机科学与技术学部,山东济南250000 [2]齐鲁工业大学(山东省科学院)山东省科学院情报研究所,山东济南250000

出  处:《数据与计算发展前沿(中英文)》2025年第2期141-148,共8页Frontiers of Data & Computing

基  金:山东省科技型中小企业创新能力提升工程(2023TSGC0135)。

摘  要:【应用背景】在处理机构名称数据时,经常遇到机构名称不一致的问题。由于个体间的认知差异和主观偏好,同一机构可能会被赋予多个非规范名称。这些非规范名称通常基于普遍的认知常识、能够被广泛理解和接受,并且通常不会出现一个非规范名称对应多个规范名称的情况。【方法】基于此,提出了一种基于Trie树查找和非关键词消除的中文机构名称归一化算法。通过非关键词消除、Trie树模糊匹配和复核取优等步骤,实现了中文机构名称的自动归一化,提升了数据整合的准确性和效率。【结论】实验结果表明,该方法在提高机构名称归一化准确率和匹配效率方面表现较好。[Background]When processing institution name data,we often encounter the problem of inconsistent institution names.Due to cognitive differences and subjective preferences among individuals,the same institution may be assigned multiple non-standard names.These non-standard names are usually based on common cognitive knowledge,widely understood and accepted,and there is usually no situation where one non-standard name corresponds to multiple standardized names.[Methods]Based on this,this article proposes a Chinese institution name normalization algorithm based on Trie tree search and unessential words elimination.The automatic normalization of Chinese institution names has been achieved through unessential words elimination,Trie tree fuzzy matching,and review to obtain superior results,improving the accuracy and efficiency of data integration.[Conclusions]Experimental results show that this method performs well in improving the accuracy of institution name normalization and matching efficiency.

关 键 词:归一化 非关键词消除 数据清洗 TRIE树 编辑距离查找 复核取优 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象