检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王洁宁[1] 侯海洋 贾奇 WANG Jie-ning;HOU Hai-yang;JIA Qi(College of Air Traffic Management,Civil Aviation University of China,Tianjin 300300,China;East China Regional Air Traffic Management Bureau,Civil Aviation Administration of China,Shanghai 200335,China)
机构地区:[1]中国民航大学空中交通管理学院,天津300300 [2]民航华东地区空管局,上海200335
出 处:《安全与环境学报》2022年第2期826-835,共10页Journal of Safety and Environment
基 金:国家重点研发计划项目(2016YFB0502401);民航华东空管局科技项目(KJ1804)。
摘 要:针对空管系统的危险源自由文本类别不均衡导致分类器对多数类样本过拟合的问题,结合SMOTE算法和改进级联模型提升危险源文本分类精度。首先对危险源文本集进行分词和停用词处理,并使用TF-IDF算法提取危险源文本特征将其向量化,利用SMOTE算法对向量化后的少数类文本进行随机生成,使文本集的类别分布趋于均衡;再从基分类器和输出类别向量的权重两方面改进级联模型,提高对不均衡空管危险源文本的分类效果。为验证模型的适用性,以空管系统危险源报告为数据源,通过试验验证模型对危险源文本的分类性能。结果表明,Borderline-SMOTE+改进级联模型与传统分类方法相比,能够有效提升少数样本的分类效果,从而提升整体空管危险源文本的分类精度。The imbalance of the free text categories of hazard sources in the Air Traffic Management(ATM) system leads to the problem that the classifier overfits most samples. To solve the problem, the SMOTE algorithm and the Improved Cascade Model(ICM) are combined to improve the accuracy of hazard text classification. First, we used the TF-IDF algorithm to extract the preprocessed hazard report features and vectorize them. The SMOTE and Borderline-SMOTE algorithms are used to randomly generate vectorized minority data so that the category distribution of the text set tend to be balanced. To show the processing results of SMOTE and Borderline-SMOTE algorithms, we took the largest number of hazard data and the smallest number of hazard source data as examples. Then, considering the types of base classifiers and the weights of output categories, we built the ICM combining SVM, Random Forest, Logistic Regression, Multinomial Bayesian model as the base classifier. And, the category vector weights of the base classifier was adjusted according to the accuracy of the base classifier and the number of samples. Finally, to verify the applicability of the proposed method in hazard reports, the ATM system hazard reports were used as training data to compare the precision, recall, F1 values of the Improved Cascade Model and the Cascade Forest model. The results show that the ICM can enhance the classification effect of the overall hazard reports of ATM system compared with the Cascade Forest;Compared with ICM, the SMOTE+ICM algorithm will slightly reduce the Precision, Recall, and F1-score of some majority hazard sample, but it can enhance the classification performance of minority hazard sample, thereby improving the overall classification performance of the hazard reports;Compared with SMOTE+ICM, Borderline-SMOTE+ICM algorithm can reduce the mean Precision of overall hazard reports by 0.3%, but the average Recall rate and F1-score increase by 1.7% and 0.8% respectively. Therefore, the Borderline-SMOTE+ICM algorithm is the optimal c
关 键 词:安全社会工程 空管系统 危险源分类 不均衡数据 SMOTE 改进级联模型
分 类 号:X949[环境科学与工程—安全科学]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.126.147