针对特定领域的新词发现方法研究被引量：1

Research on New Word Discovery Methods Facing the Military Field

作　　者：申兆媛巢翌[1] 李晓龙张伟 SHEN Zhao-yuan;CHAO Yi;LI Xiao-long;ZHANG Wei(Beijing Institute of Control and Electronic Technology,Beijing 100038,China)

机构地区：[1]北京控制与电子技术研究所,北京100038

出　　处：《计算机仿真》2022年第6期269-273,335,共6页Computer Simulation

摘　　要：如何准确识别文本中的领域新词是保证企事业内数据安全中的一项重要任务,针对特定领域语料的特性,提出一种针对特定领域的新词发现方法。首先预处理语料,其次采用Jieba结合本领域的成词策略分词,N-gram滑动取词得到候选词串,再次利用点互信息、邻接熵、词频与归一化得分筛选新词,从次新词向量化并降维,最后K-means分离领域或常用新词,从而得到领域新词集。解决了通用新词发现方法在特定领域的不适应性问题,在某领域约10万行的语料数据上,通过对比实验验证了上述方法的有效性。How to accurately identify domain new words in the text is an important task in the security work in ensuring data security in enterprises and institutions. This article proposes a new word discovery method for specific domains based on the characteristics of a specific domain corpus. Firstly, the corpus was preprocessed. Secondly, Jieba was used to combine the word-formation strategy in a specific field to segment words. And the N-gram was used for sliding word retrieval to obtain the candidate word string. Thirdly, the pointwise mutual information, branch entropy, word frequency and normalized score were used to filter new words. Then, new words were vectorized and dimensionality reduced. Finally, K-means was used to separate domain new words or commonly used new words to obtain domain new word sets. This method solves the problem of the incompatibility of the general new word discovery method in a specific field. On the corpus data of about 100,000 lines in a certain field, the effectiveness of this method is verified by comparative experiments.

关键词：新词发现点互信息邻接熵聚类

分类号：TP393.08[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

针对特定领域的新词发现方法研究被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

针对特定领域的新词发现方法研究 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

针对特定领域的新词发现方法研究被引量：1