检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:申兆媛 巢翌[1] 李晓龙 张伟 SHEN Zhao-yuan;CHAO Yi;LI Xiao-long;ZHANG Wei(Beijing Institute of Control and Electronic Technology,Beijing 100038,China)
出 处:《计算机仿真》2022年第6期269-273,335,共6页Computer Simulation
摘 要:如何准确识别文本中的领域新词是保证企事业内数据安全中的一项重要任务,针对特定领域语料的特性,提出一种针对特定领域的新词发现方法。首先预处理语料,其次采用Jieba结合本领域的成词策略分词,N-gram滑动取词得到候选词串,再次利用点互信息、邻接熵、词频与归一化得分筛选新词,从次新词向量化并降维,最后K-means分离领域或常用新词,从而得到领域新词集。解决了通用新词发现方法在特定领域的不适应性问题,在某领域约10万行的语料数据上,通过对比实验验证了上述方法的有效性。How to accurately identify domain new words in the text is an important task in the security work in ensuring data security in enterprises and institutions. This article proposes a new word discovery method for specific domains based on the characteristics of a specific domain corpus. Firstly, the corpus was preprocessed. Secondly, Jieba was used to combine the word-formation strategy in a specific field to segment words. And the N-gram was used for sliding word retrieval to obtain the candidate word string. Thirdly, the pointwise mutual information, branch entropy, word frequency and normalized score were used to filter new words. Then, new words were vectorized and dimensionality reduced. Finally, K-means was used to separate domain new words or commonly used new words to obtain domain new word sets. This method solves the problem of the incompatibility of the general new word discovery method in a specific field. On the corpus data of about 100,000 lines in a certain field, the effectiveness of this method is verified by comparative experiments.
分 类 号:TP393.08[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7