基于语义树的非结构化年鉴Excel表格的ETL方法  被引量:3

ETL method of unstructured yearbook Excel form based on semantic tree

在线阅读下载全文

作  者:赵乐[1] 赵宏宇[1] 刘斌[2] 陈彦如[3] ZHAO Le;ZHAO Hongyu;LIU Bin;CHEN Yanru(School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu Sichuan 611756,China;China Railway SiYuan Survey and Design Group Corporation Limited,Wuhan Hubei 430063,China;School of Economics and Management,Southwest Jiaotong University,Chengdu Sichuan 610031,China)

机构地区:[1]西南交通大学计算机与人工智能学院,成都611756 [2]中铁第四勘察设计院集团有限公司,武汉430063 [3]西南交通大学经济管理学院,成都610031

出  处:《计算机应用》2021年第S02期131-135,共5页journal of Computer Applications

基  金:国家重点研发计划项目(2018YFB1601402)。

摘  要:针对计算机程序在对海量Excel表格中的中国城市年鉴数据进行抽取-转换-加载(ETL)时由于表格的非结构化格式导致的准确率和查全率较低的问题,提出了一种基于语义树的ETL方法。首先,通过数据与表格行名、列名的对应关系分别建立两棵语义树模型,利用两棵语义树生成包含数据项以及数据项所属行名、列名的元数据集合;然后,通过正则表达式对每个元数据所属行名、列名进行语义匹配,从该集合中删除不需要的分项或汇总项元数据;接着,通过三种基于字典的过滤策略进一步做数据清洗,将剩余的元数据导入到数据仓库中;最后,从总量30万个统计年鉴表格中随机抽取604个进行程序ETL与人工ETL对比,实验结果表明,所提出的程序ETL可达到人工ETL86.51%的准确率和95.15%的查全率,可以满足考察地方发展现状、编制和发展未来规划的需求。Aiming at the problem of low accuracy and recall rate caused by the unstructured format of table when the computer program performs Extract-Transform-Load(ETL)on the massive Chinese city yearbook data in the massive Excel table,a semantic tree-based ETL method was proposed. First,two semantic tree models were established through the correspondence between data and table row and column names,which were used to generate a metadata collection containing data items and the row and column names to which the data items belong. Then,through regular expressions,semantic matching was performed on the row and column names of each metadata,and the unnecessary sub-item or summary item metadata is deleted from the collection. Then,data was cleaned further through three dictionary-based filtering strategies,and the remaining metadata was imported into the data warehouse. Finally,604 tables were randomly selected from a total of300 000 statistical yearbook tables to compare program ETL with manual ETL. The experimental results show that the proposed program ETL can reach 86. 51% accuracy and 95. 15% recall of manual ETL,which can meet the needs of investigating the status quo of local development,compiling and developing future plans.

关 键 词:EXCEL表格 抽取-转换-加载 语义树 正则表达式 数据仓库 

分 类 号:TP391.13[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象