融合关键词和语义特征的汉越文本相似度计算  被引量:1

Similarity Calculation of Chinese and Vietnamese Text by Combining Keywords and Semantic Features

在线阅读下载全文

作  者:潘润海 高盛祥[1,2] 余正涛[1,2] 刘奕洋 尤丛丛 PAN Run-hai;GAO Sheng-xiang;YU Zheng-tao;LIU Yi-yang;YOU Cong-cong(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)

机构地区:[1]昆明理工大学信息工程与自动化学院,昆明650500 [2]昆明理工大学云南省人工智能重点实验室,昆明650500

出  处:《小型微型计算机系统》2022年第6期1309-1314,共6页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(61761026,61972186,61732005,61672271,61762056)资助;国家重点研发计划项目(2019QY1802,2019QY1801,2019QY1800)资助;云南省高科技人才项目(201606,202105AC160018)资助;云南省重大科技专项计划项目(202002AD080001-5,202103AA080015)资助;云南省基础研究计划项目(202001AS070014,2018FB104)资助;昆明理工大学省级人培项目(KKSY201703005)资助.

摘  要:汉越文本相似度计算是实现汉越文本理解和文本分类的基础.目前使用神经网络来计算文本相似度是一个有效方法,但由于文本较长、冗余信息较多,神经网络难以有效捕获文本间的相似信息,同时汉-越平行语料稀缺导致模型泛化性能一般,此方法受到一定限制.故提出一种融合关键词和语义特征的汉越文本相似度计算方法.针对文本较长冗余信息较多,提出使用文本关键词来获得文本关键信息以压缩文本减少冗余,同时计算出文本间关键词相似信息;针对汉-越平行语料稀缺,提出使用知识蒸馏的方法来训练神经网络来对文本进行编码,得到上下文语义特征;最后将词的相似信息和上下文语义特征融合实现文本相关性判断.实验表明,本文提出的方法能有效提升汉-越文本相似度计算的准确率.Chinese-Vietnamese text similarity calculation is the basis for realizing Chinese-Vietnamese text comprehension and text classification.At present,the use of neural networks to calculate text similarity is an effective method,but due to longer texts and more redundant information,neural networks are difficult to effectively capture similar information between texts,and the current scarcity of Chinese-Vietnamese parallel corpus leads to the generalization performance of the model,which leads to certain restrictions on this method.Therefore,a method for calculating the similarity of Chinese and Vietnamese texts that combines keywords and semantic features is proposed.In view of the longer text and more redundant information,it is proposed to use text keywords to obtain key text information to compress the text and reduce redundancy,and at the same time calculate the keyword similar information between texts;In view of the scarcity of Chinese-Vietnamese parallel corpus,a method of knowledge distillation is proposed to train a neural network to encode text and obtain contextual semantic features;Finally,the similar information of the words and the semantic features of the context are merged to realize the text relevance judgment.Experiments show that the method proposed in this paper can effectively improve the accuracy of Chinese-Vietnamese text similarity calculation.

关 键 词:汉-越 文本相似度 BERT 关键词 神经网络 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象