编制价格指数的爬虫数据抽样方法研究  

Research on Crawler Data Sampling Method for Price Index Compilation

在线阅读下载全文

作  者:雷兵[1] 梁凯凯 刘维[1] Lei Bing;Liang Kaikai;Liu Wei(School of Management,Henan University of Technology,Zhengzhou 450000,China)

机构地区:[1]河南工业大学管理学院,郑州450000

出  处:《统计与决策》2024年第12期24-28,共5页Statistics & Decision

基  金:国家社会科学基金一般项目(18BGL268);河南省高校哲学社会科学创新团队资助项目(2019-CXTD-04)。

摘  要:文章针对全量爬虫数据编制价格指数成本高的问题,提出了一种抽样方法。该方法采用“大数据—小数据”思想,在基期通过网络爬虫技术全量抓取电商平台的商品交易数据,形成抽样框;在连续性调查中采用抽样技术,根据分层抽样思想,运用聚类算法及其轮廓系数实现总体数据分层,并通过不等概率随机抽样获取各层代表性样本;考虑到连续性调查中入选样本存在无回答现象,提出正式和备选样本思路,针对每个正式样本,采用最近邻匹配法挑选若干个备选样本,当正式样本无回答时,以备选样本作为替补来完成价格指数编制。以天猫商城粮油品类为例进行验证,结果表明:在抓取的数据中,基期全量爬虫数据有18351条,第2—8期连续性调查的平均抽样比为10.18%,抽样的平均相对误差为0.59%,说明该方法是可行的。Aiming at the problem of high cost of compiling price index with full crawler data,this paper proposes a sampling method.This method adopts the idea of“big data-small data”,and fully captures the commodity transaction data of the e-commerce platform through web crawler technology in the base period to form a sampling frame.Sampling techniques are used in continuous surveys;according to the idea of stratified sampling,clustering algorithms and silhouette coefficients are used to achieve overall data stratification;representative samples of each stratum are obtained through random sampling with unequal probability.Considering the non-response phenomenon of the selected samples in the continuous survey,the idea of formal and alternative samples is proposed.For each formal sample,the nearest neighbor matching algorithm is used to select several alternative samples.When the formal sample has no answer,the alternative sample is used as a substitute to complete the price index compilation.Finally,the grain and oil category in Tmall mall is used as an example for experimental validation,and the results show that in the captured data,the full-amount crawler data in the base period is 18351,the average sampling ratio of the continuous survey from 2 to 8 periods is 10.18%,and the average relative error of sampling is 0.59%,which indicates that the method is feasible.

关 键 词:价格指数 爬虫数据 分层抽样 聚类算法 样本匹配 

分 类 号:C813[社会学—统计学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象