基于句法树节点嵌入的作者识别方法  

Authorship identification method based on the embedding of the syntax tree node

在线阅读下载全文

作  者:张洋 江铭虎[1] ZHANG Yang;JIANG Minghu(Computational Linguistics Laboratory,Department of Chinese,School of Humanities,Tsinghua University,Beijing 100084,China)

机构地区:[1]清华大学人文学院中文系,计算语言学实验室,北京100084

出  处:《清华大学学报(自然科学版)》2023年第9期1390-1398,共9页Journal of Tsinghua University(Science and Technology)

基  金:国家自然科学基金重点项目(62036001)。

摘  要:作者识别是通过分析未知文本的写作风格推断作者归属的交叉学科。现有的研究多基于字符和词汇特征,而句法关联信息在研究中鲜有涉及。该文提出了基于句法树节点嵌入的作者识别方法,将句法树的节点表示成其所有依存弧对应的嵌入之和,把依存关系信息引入深度学习模型中。然后构建句法注意力网络,并通过该网络得到句法感知向量。该向量同时融合了依存关系、词性以及单词等信息。接着通过句子注意力网络得到句子的表示,最后通过分类器进行分类。在3个英文数据集的实验中,该文方法的性能位列第2或3位。更重要的是,依存句法组合的引入为模型的解释提供了更多的方向。[Objective]Authorship identification is a study for inferring authorship of an unknown text by analyzing its stylometry or writing style.The traditional research on authorship identification is generally based on the empirical knowledge of literature or linguistics,whereas modern research mostly relies on mathematical methods to quantify the author’s writing style.Currently,researchers have proposed various feature combinations and neural network models.Some feature combinations can achieve better results with traditional machine learning classifiers,while some neural network models can autonomously learn the relationship between the input text and corresponding author to extract text features implicitly.However,the current research mostly focuses on character and lexicon features.Furthermore,the exploration of syntactic features is limited.How to use the dependency relationship between different words in a sentence and combine syntactic features with neural networks still remains unclear.This paper proposes an authorship identification method based on the syntax tree node embedding,which introduces syntactic features into a deep learning model.[Methods]We believe that an author’s writing style is mainly reflected in the way he chooses words and constructs sentences.Therefore,this paper mainly develops the authorship identification model from the perspectives of words and sentences.The attention mechanism is used to construct sentence-level features.First,an embedding representation of the syntax tree node is proposed,and the syntax tree node is expressed as a sum of embeddings corresponding to all its dependency arcs.Thus,the information on sentence structure and the association between words are introduced into the neural network model.Then,a syntactic attention network using different embedding methods to vectorize text features,such as dependencies,part-of-speech tags,and words,is constructed,and a syntax-aware vector is obtained through this network.Furthermore,the sentence attention network is used to e

关 键 词:作者识别 句法树节点 依存关系 注意力机制 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象