基于BERT和DBSCAN工程项目维保文本数据挖掘  

Data Mining of Engineering Project Maintenance Text Based on BERT and DBSCAN

在线阅读下载全文

作  者:胡学聪 HU Xuecong(Shanghai Company of Shanghai Zhongjian Dongfu Investment Development Co.,Ltd.,Shanghai 150000,China)

机构地区:[1]上海中建东孚投资发展有限公司上海公司,上海150000

出  处:《安徽理工大学学报(自然科学版)》2023年第4期87-93,共7页Journal of Anhui University of Science and Technology:Natural Science

摘  要:随着人口增速放缓,建筑行业的目标正逐渐由追求高周转率向追求高品质过渡。在项目交付和维保过程中,客户的维保诉求体现其关心的痛点和项目建设过程中的不足。因此,通过有效挖掘维保文本价值可在施工过程中加强针对性的监管,帮助公司交付客户满意的产品。由于客户并不具备专业的工程背景,其上报的维保投诉多为充斥着大量无关信息的短文本,传统方法依靠客服人员根据损坏原因人工分类数据,工作量大且效率较低下。采用词频-逆文档频率(TD-IDF)和具有噪声的基于密度的聚类方法(DBSCAN)构建一种基于关键词的文本粗分类器,将文本聚类为带有清晰标签的已分类文本和无法有效分类的噪声;通过已分类文本微调预训练语言表征模型(BERT)构建文本细分类器,完成无法分类噪声的再分类。以上海某项目交付及日常使用过程中的720条无标签客诉文本进行验证,结果表明,粗分类器可将44.03%的文本有效划分为6类,细分类器可将83.75%的文本完成有效分类。With the deceleration of the population growth,the objectives of the construction industry are transitioning from pursuing high turnover rates to prioritizing high-quality outcomes.During the project delivery and maintenance processes,the customer maintenance complaints reflect their concerns and the deficiencies encountered during the project construction.Therefore,effectively mining the value of maintenance text can strengthen targeted supervision during the construction process and assist companies in delivering satisfactory products to clients.Sincethe customers lack professional engineering backgrounds,their reported maintenance complaints mostly consist of short texts filled with irrelevant information.Traditional methods rely on customer service personnel to manually categorize data based on the reasons for damages,resulting in significant workload and low efficiency.In this study,the TD-IDF and DBSCAN algorithms were employed to construct a keyword-based text coarse classifier,which clustered the text into categorized texts with clear labels and noise that couldnot be effectively classified.Then the BERT model with the categorized texts was fine-tuned to build a text fine classifier,completing the reclassification of the unclassified noise,which was applied to 720 unlabeled customer complaints in the context of a project delivery and daily usage of a construction project in Shanghai.The results indicated that the coarse classifier partitionedthe 44.03% of the text into six categories effectively,while the fine classifier achieved a successful classification rate of 83.75% for the text.

关 键 词:中文短文本 客户投诉 聚类分析 维保分析 预训练语言表征模型 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象