基于CoSENT和改进K-means的冒犯性评论文本主题识别  

Text Topic Recognition of Offensive Comments Based on CoSENT and Improved K-means

在线阅读下载全文

作  者:陈健飞 卜凡亮[1] 王一帆[1] CHEN Jian-fei;BU Fan-liang;WANG Yi-fan(School of Information Network Security,People's Public Security University of China,Beijing 100038,China)

机构地区:[1]中国人民公安大学信息网络安全学院,北京100038

出  处:《科学技术与工程》2024年第31期13442-13449,共8页Science Technology and Engineering

基  金:中国人民公安大学安全防范工程双一流专项(2023SYL08)。

摘  要:为快速识别冒犯性评论文本中的用户热点主题,解决传统主题模型在处理评论文本时语义描述不充分、上下文信息丢失和主题连贯性不强,以及K-means聚类算法对K值和初始中心点敏感的问题。使用CoSENT(cosine sentence)模型获取包含冒犯性语言的评论文本的句子级向量特征,对通过统一流形逼近与投影算法即UMAP(uniform manifold approximation and projection)模型降维后的向量矩阵使用基于Canopy+的改进K-means算法进行类簇划分,用(class term frequency-inverse document frequency,c-TF-IDF)识别各主题簇的主题特征,进行主题建模。通过对比冒犯性评论文本数据集以及普通评论数据集的实验验证了方法有效性。结果表明本文方法能够得到更好的主题一致性。To quickly identify users'hot topics in offensive comment texts and solve the problems of insufficient semantic description,loss of contextual information,and weak topic coherence of traditional topic models when dealing with comment texts,as well as the sensitivity of K-value and initial centroid of K-means clustering algorithm.The CoSENT(cosine sentence)model was used in this paper to obtain sentence-level vector features of comment texts containing offensive language.An improved K-means algorithm based on Canopy+was used for class clustering on the vector-matrix after dimensionality reduction through the UMAP(uniform manifold approximation and projection)model.c-TF-IDF(class term frequency-inverse document frequency)was used to identify the thematic features of each thematic cluster for thematic modeling.The validity of the method is verified through experiments comparing the offensive comment text dataset as well as the ordinary comment dataset.The results show that the method in this paper can get better topic consistency.

关 键 词:自然语言处理 主题模型 CoSENT K-MEANS 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象