融合中文字形和字义的字向量表示方法  被引量:7

Character Vector Representation Method Combining Chinese Character Glyph and Character Semantics

在线阅读下载全文

作  者:唐善成[1] 张雪 张镤月 王瀚博 陈明 TANG Shan-cheng;ZHANG Xue;ZHANG Pu-yue;WANG Han-bo;CHEN Ming(College of Communication and Information Engineering, Xi'an University of Science and Technology, Xi'an 710054, China)

机构地区:[1]西安科技大学通信与信息工程学院,西安710054

出  处:《科学技术与工程》2021年第32期13787-13792,共6页Science Technology and Engineering

基  金:国家重点研发计划(2018YFC0808300);陕西省科技计划重点产业创新链(群)项目(2020ZDLGY15-07);西安市科技计划科技创新引导项目(201805036YD14CG20(4))。

摘  要:字向量表示质量对中文文本处理方法有重要影响。常用中文字向量表示方法Word2Vec、GloVe存在没有考虑汉字整体字形结构所隐含的语义信息、没有利用字典包含的语言知识等问题。为了克服现有方法的不足,提出了融合中文字形和字义的字向量表示方法GnM2Vec(glyph and meaning to vector),首先采用字形自编码器自动捕获汉字字形蕴含的语义,得到字形向量,然后基于字形向量表示每条字义中的每个汉字,得到基于字形向量的字义向量,最后通过字义自编码器处理生成融合字形和字义的字向量表示。实验结果表明,在命名实体识别实验中,F1值较GloVe、Word2vec、G2Vec(基于字形向量)分别提高了2.25、0.05、0.3;在中文分词实验中,F1值分别提高了0.3、0.14、0.33。在短文本语义相似度计算实验中,使用了卷积神经网络(convolutional neural network,CNN)、Self-Attention和长短期记忆网络(long short-term memory,LSTM)3个模型,F1均值较word2vec和GloVe分别提高了3.24、1.99。The quality of the character vector plays an important role in Chinese text processing.Whole Chinese characters glyph structure implied semantic information and common dictionary contains language knowledge has not been considered in common character vector representation methods Word2Vec and GloVe.In order to overcome the shortcomings of the existing methods,a word vector representation method combining characters glyph and character semantics glyph and meaning to vector(GnM2Vec)was proposed.Firstly,the glyph autoencoder was used to capture the meaning of a Chinese character automatically,and the glyph vectors was obtained.Then,each character in each character semantics was represented based on the glyph vectors,and the character semantics vector based on the glyph vectors was obtained.Finally,character semantics representation that integrates the character and character semantics was obtained by the character semantics autoencoder.The results show that in the named entity recognition the value of F_(1) is 2.25,0.05 and 0.3 higher than that of GloVe,Word2Vec and G2Vec(based on glyph vectors),respectively.In the chinese word segmentation,the F_(1) value increases by 0.3,0.14 and 0.33,respectively.In the experiment of short text semantic similarity calculation,convolutional neural network(CNN),Self-Attention and long short-term memory(LSTM)models were used.The mean value of F_(1) is 3.24 and 1.99 higher than that of Word2Vec and GloVe,respectively.

关 键 词:字向量表示 字形 字义 卷积自编码器 自然语言处理 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象