基于电力设备大数据的预训练语言模型构建和文本语义分析  被引量:9

Research on Pre-training Language Model Construction and Text Semantic Analysis Based on Power Equipment Big Data

在线阅读下载全文

作  者:贾骏 杨强[2] 付慧 杨景刚 何禹德 JIA Jun;YANG Qiang;FU Hui;YANG Jinggang;HE Yude(Research Institute,State Grid Jiangsu Electric Power Co.,Ltd.,Nanjing 211103,Jiangsu Province,China;College of Electrical Engineering,Zhejiang University,Hangzhou 310027,Zhejiang Province,China;State Grid Jiangsu Electric Power Co.,Ltd.,Nanjing 210000,Jiangsu Province,China;Big Data Center of State Grid Corporation,Xicheng District,Beijing 100031,China)

机构地区:[1]国网江苏省电力有限公司电力科学研究院,江苏省南京市211103 [2]浙江大学电气工程学院,浙江省杭州市310027 [3]国网江苏省电力有限公司,江苏省南京市210000 [4]国家电网有限公司大数据中心,北京市西城区100031

出  处:《中国电机工程学报》2023年第3期1027-1036,共10页Proceedings of the CSEE

摘  要:在电力设备运维管理过程中,如何运用非结构化文本信息,构造电力设备文本语义分析模型,挖掘非结构化文本信息,提升设备缺陷和故障的诊断速度和准确性,辅助电网运行检修决策,是非常具有应用价值的问题。该文提出基于超大规模预训练方法的电力设备文本语义分析模型(PowerBERT)。该模型基于多头注意力机制,采用多层嵌入语义表达结构,模型总参数超过1.1亿,实现对电力文本内蕴含的信息的理解和分析。基于超过18.62亿字符的电力标准、管理规定及检修记录文本构成的电力专业语料,并采用字符掩码、实体掩码、片段掩码等多种掩码机制和动态加载策略开展模型预训练。针对电力设备文本分析场景,在电力文本实体识别、信息抽取和缺陷诊断场景进行任务场景训练和优化。与传统深度学习算法进行对比实验的结果表明,该文所提方法在基于极少的场景任务样本的情况下,在验证集和测试集上实现召回率和精准度20%~30%的性能提升。During the operation and maintenance management of power equipment in the electric power grid,a massive amount of unstructured text information is available.The construction of a power equipment text semantic analysis model for mining the unstructured text information to improve the efficiency and accuracy of equipment defect and fault diagnosis,and assist power grid operation and maintenance decision-making is a practical and challenging task.This paper proposes a semantic analysis model of power equipment text based on a super large-scale pre-training method(Power BERT).The proposed solution adopts the multi-head attention mechanism and multi-layer embedded semantic expression structure.The total number of model parameters exceeds 110 million to implement the understanding and analysis of the information contained in the power text.The data sources cover the power professional corpus consisting of power standards and management regulations(more than 1.862 billion characters in total).It adopts various mask mechanisms,e.g.,character mask,entity mask and fragment mask and dynamic loading strategies to carry out the model pre-training,task scenario training and optimization for the text entity recognition,information extraction and defect diagnosis.The proposed solution is assessed through comparison with the conventional deep learning algorithms,and the numerical results demonstrate that the proposed solution can improve the recall and accuracy of verification set and test set by 20%~30%based on a limited number of samples.

关 键 词:深度学习 预训练语言模型 电力设备 自然语言处理 语义分析 缺陷分级 

分 类 号:TM72[电气工程—电力系统及自动化]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象