检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨斌 王正阳[2] 程梓航 赵慧英 王鑫[1] 管宇[2,3] 程新洲 Yang Bin;Wang Zhengyang;Cheng Zihang;Zhao Huiying;Wang Xin;Guan Yu;Cheng Xinzhou(China Unicom Research Institute,Beijing 100048;School of Artificial Intelligence,Beijing University of Posts and Telecommunications,Beijing 100876;Yunnan Key Laboratory of Software Engineering(Yunnan University),Kunming 650504)
机构地区:[1]中国联通研究院,北京100048 [2]北京邮电大学人工智能学院,北京100876 [3]云南省软件工程重点实验室(云南大学),昆明650504
出 处:《计算机研究与发展》2024年第2期324-337,共14页Journal of Computer Research and Development
基 金:云南省软件工程重点实验室开放基金项目(2023SE202)。
摘 要:在数据挖掘领域普遍存在数据不平衡影响到模型预测精度的问题,同时还存在未考虑用户隐私保护的问题.生成伪造数据是一种重要的解决方法,但在以结构化数据为主的场景中,由于存在数据特征维度多且不相关等特点,生成高质量的数据存在挑战.考虑到扩散模型在图像生成等任务中被成功应用,以客户流失预测为典型应用场景,尝试将扩散模型应用到客户流失预测任务中.针对该场景数据中的数值型特征和类别型特征,通过高斯扩散模型和多项式扩散模型获得生成数据,并对模型预测效果和数据隐私保护能力进行研究和分析.在多个领域的客户流失数据上进行了大量实验,探索应用生成数据对真实数据融合重构的可能性.实验结果表明基于扩散模型可生成高质量数据,且对多种预测方法均有一定提升,可实现缓解数据不平衡问题.同时,基于扩散模型生成的数据分布更接近真实数据,具有应用于用户隐私保护的潜在价值.In the field of data mining,the issue of data imbalance impacting model prediction accuracy is widespread,and also the issue of user privacy protection is neglected.Fake dataset generation has come to light as a crucial remedy for these problems.However,because of the traits of high-dimensional and irrelevant features,it is difficult to generate high-quality data in circumstances where structured data predominate.Considering the successful applications of the diffusion model in image generation task,we aim to apply the diffusion model for the task of customer churn prediction,which is a typical scenario in data mining.we utilize the Gaussian diffusion model and polynomial diffusion model to generate data for numerical and categorical features in customer churn data.Research and analysis have been conducted on the predictive performance and data privacy protection capabilities of our model.We conduct extensive experiments on customer churn data from multiple domains to explore the potential of fusing synthetic dataset and real dataset for data reconstruction.The results demonstrate that the diffusion model can generate high-quality data and improve the performance of various prediction methods,which can help alleviate the issue of data imbalance.Additionally,the data produced by the diffusion model exhibit a distribution that is quite similar to the original dataset,which may be useful for protecting user privacy.
关 键 词:客户流失 扩散模型 用户隐私 数据生成 类别特征
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15