电力非结构化大文本特征提取研究  

Research on feature extraction of unstructured large power texts

在线阅读下载全文

作  者:王家凯 黄佩卓 李勇乐 盛爽 刘洋 郑玲[2] 魏振华[2] WANG Jiakai;HUANG Peizhuo;LI Yongle;SHENG Shuang;LIU Yang;ZHENG Ling;WEI Zhenhua(Big Data Center of State Grid Corporation of China,Beijing 100052,China;North China Electric Power University,Beijing 100026,China)

机构地区:[1]国家电网有限公司大数据中心,北京100052 [2]华北电力大学,北京100026

出  处:《浙江电力》2024年第6期117-124,共8页Zhejiang Electric Power

基  金:国家自然科学基金(62373150);国家电网公司大数据中心科技专项资助项目(SGSJ0000YYJS2310054)。

摘  要:电力大文本中存在大量专业词汇缩写和别名等不规则表达,现有分词工具无法有效识别电气工程领域专业词汇,这对非结构化文本的分析和利用造成很大影响。首先,根据电气工程领域非结构化文本特点,提出一种电气工程领域词汇索引规则,基于该索引规则构建的索引集进行分词能够有效改善分词效果,为电力文本特征提取提供基础。其次,利用有效的长文本分割算法保留原始文本语义信息,将基于BERT模型提取的文本特征信息与Word2Vec提取的电力词汇特征信息进行联合嵌入,从而提取到准确的电力非结构化大文本特征。最后,通过实验证明了所提出的电力非结构化大文本特征提取方法的有效性。Large power texts contain numerous abbreviations of technical terms,alternative names,and irregular expressions.Existing word segmentation tools often fail to identify specialized vocabulary in the electrical engineering field,significantly hindering the analysis and utilization of unstructured texts.To address this challenge,this paper proposes a set of indexing rules tailored to the characteristics of unstructured texts in electrical engineering.Segmentation based on these rules can significantly enhance segmentation accuracy,laying a solid foundation for feature extraction of power texts.Furthermore,by employing effective long-text segmentation algorithms to preserve the semantic information of the original text,the paper integrates and embeds text feature information extracted by the BERT model with vocabulary feature information extracted by Word2Vec.This combined approach enables the extraction of precise features from large unstructured power texts.Finally,experimental results have demonstrated the effectiveness of the proposed method for extracting features from large unstructured power texts.

关 键 词:电力大文本 特征提取 BERT 文本分割 联合嵌入 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象