融合敏感词典和异构图的汉泰跨语言敏感信息识别  

Chinese-Thai cross-lingual sensitive information recognition incorporating sensitive dictionary and heterogeneous graph

在线阅读下载全文

作  者:朱栩冉 余正涛[1,2] 张勇丙 ZHU Xu-ran;YU Zheng-tao;ZHANG Yong-bing(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)

机构地区:[1]昆明理工大学信息工程与自动化学院,云南昆明650500 [2]昆明理工大学云南省人工智能重点实验室,云南昆明650500

出  处:《计算机工程与设计》2024年第7期2150-2156,共7页Computer Engineering and Design

基  金:国家自然科学基金项目(U21B2027、61972186、62266028);云南省重大科技专项计划基金项目(202202AD080003)。

摘  要:通用跨语言文本分类模型识别毒品、暴力和自然灾害等敏感信息不准确,且汉泰双语敏感词表示多样化、难对齐导致不同语言信息聚合能力较弱,为此提出一种融合敏感词典和异构图的汉泰跨语言敏感信息识别方法。利用汉泰敏感词典构建具有文档对齐和词对齐的跨语言异构图结构,将文档以及所含关键词和敏感词作为节点,双语对齐、相似关系和不同词性作为边构建汉泰跨语言异构图;基于多语言预训练模型对文档节点和词节点进行表征;通过多层图卷积神经网络对输入文档进行编码,使用敏感信息分类器对文档进行分类预测。实验结果表明,所提方法准确率较基线模型提高了5.83%。To address the problems of inaccurate recognition of sensitive information such as drugs,violence and natural disasters using general cross-lingual text classification models,and the weak ability to aggregate information in different languages due to diverse and difficult alignment of bilingual Chinese-Thai sensitive word representations,a Chinese-Thai cross-lingual sensitive information recognition method that integrated sensitive dictionaries and heterogeneous graphs was proposed.The cross-lingual heterogeneous graph structures with document alignment and word alignment to be constructed by the Chinese-Thai sensitive dictionary were used,while documents and the contained keywords and sensitive words were taken as nodes,bilingual alignment,similarity relations and different lexical properties were taken as edges to construct the Chinese-Thai cross-lingual heterogeneous graph.Document nodes and word nodes were characterized through a multilingual pre-trained model.Input documents were encoded through a multilayer graph convolutional neural network,and documents were encoded by sensitive information classifier for classification prediction.Experimental results show that the accuracy of the proposed method is improved by 5.83%compared to that of the baseline model.

关 键 词:敏感词典 跨语言 异构图 图卷积神经网络 敏感信息识别 多语言预训练模型 双语对齐 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象