检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:吴珊 李英祥[2] 徐鸿雁[1] 张仕霞 施宜军 Wu Shan;Li Yingxiang;Xu Hongyan;Zhang Shixia;Shi Yijun(School of Intelligent Technology,Tianfu College of Southwest University of Finance&Economics,Mianyang Sichuan 621000,China;College of Communication Engineering,Chengdu University of Information Technology,Chengdu 610103,China;The 5th Electronic Research Institute of MIT,Guangzhou 510507,China)
机构地区:[1]西南财经大学天府学院智能科技学院,四川绵阳621000 [2]成都信息工程大学通信工程学院,成都610103 [3]工业和信息化部电子第五研究所,广州510507
出 处:《计算机应用研究》2021年第6期1678-1682,1688,共6页Application Research of Computers
基 金:国家自然科学基金资助项目(61804032);院士基金资助项目(ZHD201806)。
摘 要:通过对文本内容中敏感词过滤方法及相关技术的研究,提出了一种基于改进的Trie树和DFA的敏感词过滤算法,解决了敏感词过滤技术中的人工干扰、分词障碍等关键问题,提高了文本中敏感词过滤的准确性和有效性。提出的算法包括三个步骤:基于排列组合的数学原理对中文词向中拼混合词进行扩充;采用改进的Trie树结构来存储DFA的所有状态,构建敏感词树;根据构建的敏感词树结构以及采用最小匹配规则对文本内容中的敏感词进行检测和过滤。通过分析得到构建敏感词树算法的时间复杂度为O(n×len),敏感词检测及过滤算法时间复杂度为O(L)。实验结果表明,本算法其查准率为100%,查全率约为87%~100%。By investigating the methods for filtering the sensitive word in the text and the related technologies,this paper proposed a sensitive word filtering algorithm based on the improved Trie tree and DFA to solve the key problems of the sensitive word filtration,including artificial interference,participle obstacles and so on.The algorithm improved the accuracy and validity of sensitive word filtering in text.It consisted of three major steps.Firstly,based on the mathematical theory of permutation and combination,it constructed an algorithm for extending Chinese word to Chinese-phonetic mixed word.Secondly,it employed the improved Trie tree structure to store all the status of DFA to build the sensitive word tree.Finally,it could search and filter the sensitive words in the text content on the basis of the construction of the sensitive word tree structure and the minimum matching rule.Through analysis,the time complexity of constructing sensitive word tree algorithm is O(n×len),and detecting and filtering sensitive word algorithm is O(L).The experimental results show that the precision rate of the algorithm is 100%,and the recall rate is about 87%~100%.
关 键 词:改进的Trie树 确定有穷自动机(DFA) 敏感词过滤 最小匹配规则
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49