通航飞机维修文本故障数据的分词方法研究  

Research on Word Segmentation Method for Fault Data of Aviation Aircraft Maintenance Text

在线阅读下载全文

作  者:付尧明[1] 陈余杰 侯宽新[1] 蒋正 FU Yao-ming;CHEN Yu-jie;HOU Kuan-xin;JIANG Zheng(Civil Aviation Flight University of China,College of Aviation Engineering,Guanghan Sichuan 618300,China)

机构地区:[1]中国民航飞行学院航空工程学院,四川广汉618300

出  处:《计算机仿真》2025年第1期30-35,125,共7页Computer Simulation

基  金:国家自然科学基金(52105132);国家自然科学基金青年科学基金项目(2022-01~2024-12);中央高校基本科研业务费专项资金资助(J2022-029)。

摘  要:中文分词是对维修文本数据处理的基础任务,面对专业领域语料往往比通用领域涵盖更多的未登录词,例如通航领域语料包含大量口语化或人工合成的结构名、部件名、故障名、工具名等未登录词,是造成分词准确率低的最主要原因。针对以上问题,面向通航领域提出一种基于BERT-BiLSTM-CRF的中文分词模型,首先利用BERT(Bidirectional Encoder Representation from Transformers)预训练模型来获取输入文本的语义特征,其次结合双向长短记忆神经网络学习上下文特征信息,最后通过条件随机场算法(CRF:Conditional RandomField)预测最优序列,提高分词准确性。利用收集通航领域维修文本数据,经过数据处理与文本标注,构建通航领域维修文本数据语料库,并基于此展开对比实验。相较于传统的BiLSTM、BiLSTM-CRF等模型,所提方法得到的综合指标F1值为96.93%,与BiLSTM-CRF相对比提升1.41%。验证了所提方法对通航领域维修文本数据进行分词的有效性。Chinese word segmentation is a fundamental task in the processing of maintenance text data.Professional domain corpora often cover more unregistered words than general domain corpora.For example,aviation domain corpora contain a large number of colloquial or artificially synthesized structure names,component names,fault names,tool names,and other unregistered words,which is the main reason for low word segmentation accuracy.To solve this problem,this paper proposes a Chinese word segmentation model based on BERT BiLSTM-CRF for the navigation field.First,the BERT(Bidirectional Encoder Representation from Transformers)pre-training model is used to obtain the semantic features of the input text.Second,the context feature information is learned by combining the bidirectional long-short memory neural network.Finally,the optimal sequence is predicted by the conditional random field algorithm,improving the accuracy of word segmentation.By collecting maintenance text data in the field of navigation,through data processing and text annotation,a corpus of maintenance text data in the field of navigation is constructed,and comparative experiments are conducted based on this.Compared to traditional models such as BiLSTM and BiLSTM-CRF,the comprehensive index F1 value obtained by the proposed method in this paper is 96.93%,which is 1.41%higher than that of BiLSTM-CRF.The effectiveness of the method proposed in this article is verified for word segmentation of maintenance text data in the navigation field.

关 键 词:通用航空 维修数据 中文分词 深度学习 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象