基于神经网络中文短文本作者识别研究  

Research on Author Recognition of Chinese Short Text Based on Neural Networks

在线阅读下载全文

作  者:李孟林[1] 罗文华[1] 李绍鸣 LI Menglin;LUO Wenhua;LI Shaoming(Department of Cyber Crime Investigation,Criminal Investigation Police University of China,Shenyang 110854,China;Shenyang Aerospace University,Human-computer Intelligence Research Center,Shenyang 110136,China)

机构地区:[1]中国刑事警察学院网络犯罪侦查系,辽宁沈阳110854 [2]沈阳航空航天大学人机智能研究中心,辽宁沈阳110136

出  处:《中国人民公安大学学报(自然科学版)》2020年第2期61-67,共7页Journal of People’s Public Security University of China(Science and Technology)

摘  要:随着互联网应用的日益普及,短文本作为电子数据证据在法庭科学中日益重要,法院亟需对大量网络聊天内容作者归属进行同一认定。传统机器学习方法对特征选取非常敏感,因为在实践中较难提取到准确的作者写作习惯特征,所以影响了传统机器学习方法的实践效果。针对文本短、特征少、特征提取困难的缺点,提出了融合多属性的神经网络中文短文本作者识别方法。首先将文本的结构特征、语义特征、发送时间、发送位置、发送频率等属性融合进文本序列,对文本序列进行词向量化表示,采用卷积层和Bi-LSTM层自动提取局部特征和上下文关系特征,通过注意力机制动态调整特征权重,使用Softmax分类器得到文本作者。以最大熵模型做对比实验,实验结果表明卷积层和Bi-LSTM层能“学习”到短文本上下文特征,注意力机制能更多“学习”到文本序列不同位置的关键特征,融合多属性的神经网络方法的作者识别精度比传统模型大约提高了5%。With the increasing popularity of Internet applications,short text as electronic data evidence is increasingly important in forensic science.The court urgently needs to identify the author of a large number of online chat content.Traditional machine learning methods are very sensitive to feature selection,because it is difficult to extract accurate author style recognition features in practice,so it affects the practical effect of traditional machine learning methods.In view of the shortcomings of short text,including few features and difficult feature extraction,a Chinese short text author recognition method based on a neural network with multi-attribute fusion was proposed.Firstly,the text structure features,semantic features,sending time,sending location,sending frequency and other attributes are integrated into the text sequence,and the text sequence is represented by word vectorization.Local features and context features are extracted automatically by convolutional layer and Bi-LSTM layer,and the feature weight is adjusted dynamically through the attention mechanism,and the text author is obtained by Softmax classifier.Using the maximum entropy model as a comparative experiment,the results show that the convolution layer and the Bi-LSTM layer can“learn”the short text context features,and the attention mechanism can“learn”the key features of different positions of the text sequence.The author's recognition accuracy of the neural network method with multi-attribute fusion is improved by about 5%compared with the traditional model.

关 键 词:短文本 多属性 Bi-LSTM 最大熵 作者识别 

分 类 号:TP393.08[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象