检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:闫吉庆 沈志远 吕靖 刘金硕[2] YAN Jiqing;SHEN Zhiyuan;LÜJing;LIU Jinshuo(China Shenhua International Engineering Gompany,Beijing 100007,China;School of Cyber Science and Engineering,Wuhan University,Wuhan 430072,China)
机构地区:[1]中国神华国际工程有限公司,北京100007 [2]武汉大学国家网络安全学院,湖北武汉430072
出 处:《武汉大学学报(工学版)》2022年第3期310-318,共9页Engineering Journal of Wuhan University
摘 要:针对招标文件中因数据稀疏导致的特征提取困难影响分类准确率的问题,提出了一种基于极端梯度提升(eXtreme gradient boosting,XGBoost)和文本聚焦表示模型的分类方法。聚焦表示部分通过提取对分类结果有显著影响的关键字段部分,使用N-Gram分词,结合词性级词频-逆文档频率(term frequency–inverse document frequency,TF-IDF)的方法,实现招标文件文本特征向量表示;基于XGBoost的招标文件分类预测模型部分将提取到的特征送入XGBoost模型,实现了将招标文件按照行业分类和按照项目类型分类。结果表明:聚焦表示模型与计数向量和TF-IDF文本表示模型相比,其特征提取的效果更好;同时,通过人工标注语料的验证表明,8种行业分类准确率高达95.3%,按照项目类型的分类准确率达到96.6%左右。与其他分类算法比较,XGBoost分类算法表现更优。Aiming at the problem that the difficulty of feature extraction caused by sparse data in bidding documents affects the classification accuracy, a classification method based on eXtreme gradient boosting(XGBoost) and text focus representation model is proposed.The focused representation part is to extract the key field parts that have a significant impact on the classification results, use N-Gram word segmentation, and combine the part of speech level term frequency-inverse document frequency(TF-IDF) method to realize the text feature vector representation of the bidding documents;at the part of the bidding document classification prediction model based on XGBoost, the extracted features are sent into the XGBoost model, the bidding documents are classified according to industry and project types.The experimental results show that the focused representation model has a better feature extraction effect than the count vector and TF-IDF text representation model.At the same time,through the verification of the manual annotation corpus,the classification accuracy rate of 8 industries is as high as 95.3%,and the classification accuracy rate according to the project type of XGBoost reaches about 96.6%. Compared with other classification algorithms, the XGBoost classification algorithm performs better.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.224.44.46