基于IBTM-TMW 的信号设备故障文本聚类方法  

Research on Fault Text Clustering Method of Signal Equipment Based on IBTM-TMW

在线阅读下载全文

作  者:杨妮[1] 张友鹏[1] 左静[1] 赵斌[1] YANG Ni;ZHANG Youpeng;ZUO Jing;ZHAO Bin(School of Automation and Electrical Engineering,Lanzhou Jiaotong University,Lanzhou Gansu 730070,China)

机构地区:[1]兰州交通大学自动化与电气工程学院,甘肃兰州730070

出  处:《中国铁道科学》2024年第6期194-201,共8页China Railway Science

基  金:国家自然科学基金资助项目(51967010,52262045);甘肃省自然科学基金资助项目(21JR7RA292)。

摘  要:针对信号设备故障文本数据存在的长度短、专业性强及难以智能化再利用等问题,提出基于改进的词对主题模型和词向量融合(IBTM-TMW)的信号设备故障文本聚类方法。首先,为减少数据噪音,提升数据质量,在数据预处理过程中引入自建词典和保留动名词处理;其次,在词对的吉布斯采样建模过程中引入词的差异性重要度作为加权因素,利用改进的词对主题模型(IBTM)提升文本主题特征的学习能力,并将词频-改进逆文档频率权重(TF-MIDF)嵌入到Word2vec词向量的生成过程,将词的文本重要性与Word2vec词向量融合,完善文本词特征向量的表示;最后,通过融合文本主题特征向量和词特征向量,增强文本特征的表示能力,并采用K-means++算法进行故障聚类分析。结果表明:同一试验数据集下,所提方法生成的文本特征向量明显优于其他传统模型,其诊断精度达到89.9%,高于K-means,GMM,AGNES和BIRCH等聚类模型(诊断精度分别为78.3%,68.1%,87.9%和81.7%)。该方法可增强故障文本特征与类别间关联关系的识别能力,为基于文本数据驱动的故障诊断提供参考。To tackle issues including short length,strong technical specificity and challenges in intelligent reuse of signal equipment fault text data,a signal equipment fault text clustering method based on improved Biterm Topic Model and Word Vector Fusion(IBTM-TMW)is proposed.Firstly,to reduce noise of the data and improve data quality,a customized dictionary and gerund processing are introduced in the process of data preprocessing.Secondly,during the Gibbs sampling modeling process of word pairs,the differential importance of words is introduced as a weighting factor,and the Improved Biterm Topic Model(IBTM)is used to improve the learning capability of text topic features.The weight of Term Frequency-Modified Inverse Document Frequency(TF-MIDF)is embedded into the generation process of Word2vec word vectors.The text importance of words is integrated into the Word2vec word vector to refine the feature vector representation of text words.Finally,the text topic feature vector and the word feature vector are integrated to enhance the text feature representation capability.On this basis,the K-means++algorithm is used for fault cluster analysis.The results show that within the same data set,the quality of the text feature vector generated by IBTM-TMW model is significantly higher than those of LDA and Label-LDA models,and its diagnostic accuracy of Correct Classification Rate(CCR)reaches 89.9%(surpassing the 78.3%,68.1%,87.9%and 81.7%accuracies of Kmeans,GMM,AGNES and BIRCH,respectively).The proposed method improves the capability of analyzing the correlation between fault text features and their categories,thereby offering a valuable reference for text-datadriven fault diagnosis.

关 键 词:故障诊断 主题模型 词向量 权重 文本聚类 

分 类 号:U268.6[机械工程—车辆工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象