基于扩展Trie树的中文敏感词变体检测  被引量:1

Chinese sensitive word variant detection based on extended Trie tree

在线阅读下载全文

作  者:赵天舒 沈颖 李柏岩[1] 刘晓强[1] 朱旻 ZHAO Tianshu;SHEN Ying;LI Baiyan;LIU Xiaoqiang;ZHU Min(School of Computer Science and Technology,Donghua University,Shanghai 201620,China;Shanghai Key Laboratory of Computer Software Testing and Evaluating,Shanghai 201112,China)

机构地区:[1]东华大学计算机科学与技术学院,上海201620 [2]上海市计算机软件评测重点实验室,上海201112

出  处:《智能计算机与应用》2024年第4期215-221,共7页Intelligent Computer and Applications

摘  要:网络语言表达方式的随意性和自由性使词语变体在网页上经常出现,给网页信息安全带来了挑战。本文针对中文敏感词变体检测问题,提出一种基于扩展Trie树的敏感词变体快速检测方法。首先,对中文敏感词变体类型进行归类,结合中文敏感词特点,通过增强节点内信息和节点间联系构建扩展Trie树;再依据中文变体的生成规则检索Trie树;最后,使用基于BERT的二分类算法对结果进行二次判别,降低误检率。实验表明:该算法精准度达到98.69%,召回率达到94.25%,能够识别常见的中文敏感词变体并在时间效率上满足应用需求。The arbitrariness and freedom of expression in internet language often lead to various word variants appearing on web pages,posing a challenge to web information security.In this paper,a fast detection method of sensitive word variants based on extended Trie tree is presented,which can be used to detect Chinese sensitive word variants.This paper first classifies the types of Chinese sensitive word variants,then builds an extended Trie tree by enhancing the information within the nodes and the connections between the nodes,then retrieves the Trie tree according to the generation rules of Chinese variants,and finally uses the BERT-based binary classification algorithm to discriminate the retrieval results twice to reduce the false detection rate.Experiments show that the accuracy of the algorithm is 98.69%and the recall rate is 94.25%.The algorithm can recognize common Chinese sensitive word variants and meet the application requirements in time efficiency.

关 键 词:敏感词 词语变体 TRIE树 BERT 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象