基于网络爬虫技术的健康医疗大数据采集整理系统  被引量:31

A collecting and processing system for health care big data based on web crawler technology

在线阅读下载全文

作  者:卞伟玮 王永超[2,3] 崔立真[2,4] 郭伟[2,4] 李晖[2,4] 周苗[1,2] 薛付忠[1,2] 刘静[1,2] 

机构地区:[1]山东大学公共卫生学院生物统计学系,山东济南250012 [2]山东大学齐鲁生物医学大数据研究中心,山东济南250012 [3]康评健康医疗大数据科技有限公司,山东济南250101 [4]山东大学计算机科学与技术学院,山东济南250101

出  处:《山东大学学报(医学版)》2017年第6期47-55,共9页Journal of Shandong University:Health Sciences

基  金:国家自然科学基金(81273177)

摘  要:目的快速、准确地获得公共卫生服务系统的医疗数据,并进行数据整理,为建立人群健康风险评估模型提供数据基础。方法运用聚焦网络爬虫技术,设计算法并编程,在自动记录和修正URL异常、原始数据存档、保持登录方式3个方面进行算法改进。将设计好的爬虫应用于爬取已获得授权网站的医疗数据,通过医学数据库系统,对数据进行解析、整理与导出。结果获得多个公共卫生服务基地数据,为当地政府部门提供数据分析报告,利用整理分析的数据完成多项健康风险评估模型建立。结论基于网络爬虫技术建立的数据采集整理系统,可以解决获取及整理网络许可数据的难题,将此技术应用于医药卫生领域,可使现有丰富的医学数据资源得以充分利用并提高利用效率。Objective To collect and process the medical data from public health service system rapidly and exactly, and to provide data base for establishing the population health risk assessment model. Methods The algorithm and pro- gram were based on focused web crawler. This study mainly improved the algorithm in three aspects: automatic record- ing and correcting URL anomaly, original data archiving and keeping login mode. Medical data of the authorized web- site were obtained by the advanced web crawler, and were parsed and sorted out via medical database system. Results Data from several public health service base were acquired to provide data analysis report for local government, and multiple health risk assessment models were constructed by means of the processed data. Conclusion Utilizing the data collecting and processing system based on web crawler, we can deal with the problem that acquiring and organizing the available data in real life. This technology can be applied in medicine and health field, which will make full use of the existing rich medical data resources and greatly improve the utilization efficiency.

关 键 词:网络爬虫 数据库系统 聚焦爬虫 数据采集 数据解析 数据整理 

分 类 号:R319[医药卫生—基础医学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象