面向工业生产的中文Text-to-SQL模型  被引量:2

Chinese Text-to-SQL model for industrial production

在线阅读下载全文

作  者:吕剑清 王先兵[1] 陈刚[1] 张华[2] 王明刚 LYU Jianqing;WANG Xianbing;CHEN Gang;ZHANG Hua;WANG Minggang(Key Laboratory of Aerospace Information Security and Trusted Computing,Ministry of Education(Wuhan University),Wuhan Hubei 430072,China;School of Computer Science,Wuhan University,Wuhan Hubei 430072,China;Zunyi Aluminum Industry Company Limited,Zunyi Guizhou 563100,China)

机构地区:[1]空天信息安全与可信计算教育部重点实验室(武汉大学),武汉430072 [2]武汉大学计算机学院,武汉430072 [3]遵义铝业股份有限公司,贵州遵义563100

出  处:《计算机应用》2022年第10期2996-3002,共7页journal of Computer Applications

基  金:国家自然科学基金资助项目(51977155)。

摘  要:英文自然语言查询转SQL语句(Text-to-SQL)任务的模型迁移到中文工业Text-to-SQL任务时,由于工业数据集的可解释差且比较分散,会出现数据库的表名列名等信息与问句中关键信息的表示形式不一致以及问句中的列名隐含在语义中等问题导致模型精确匹配率变低。针对迁移过程中出现的问题,提出了对应的解决方法并构建修改后的模型。首先,在数据使用过程中融入工厂元数据信息以解决表示形式不一致以及列名隐含在语义中的问题;然后,根据中文语言表达方式的特性,使用基于相对位置的自注意力模型直接通过问句以及数据库模式信息识别出where子句的value值;最后,根据工业问句查询内容的特性,使用微调后的基于变换器的双向编码器表示技术(BERT)对问句进行分类以提高模型对SQL语句结构预测的准确率。构建了一个基于铝冶炼行业的工业数据集,并在该数据集上进行实验验证。结果表明所提模型在工业测试集上的精确匹配率为74.2%,对比英文数据集Spider上各阶段主流模型的效果后可以看出,所提模型能有效处理中文工业Text-to-SQL任务。When the model of translating English natural language questions into Structured Query Language(SQL) statements(Text-to-SQL) is migrated to Chinese industrial Text-to-SQL task, due to the poor interpretability and strong dispersion of industrial datasets, the representation format of the information of table names and column names in database are often inconsistent with the key information in questions, and the column names in questions are often hidden in the semantics, which leads to a lower exact match accuracy. Aiming at the problems appeared in migration, the corresponding solution was proposed and a modified model was constructed. Firstly, in data use process, factory metadata information was used to solve problem of inconsistency in representation format and the problem that the column names were hidden in the semantics. Then, according to the characteristics of Chinese language expression, a self-attention model based on relative position was used to directly identify the value of where clause by questions and database mode information. Finally, according to the characteristics of the query of industrial questions, the fine-tuned Bidirectional Encoder Representation from Transformers(BERT) was used to classify questions in order to improve the accuracy of SQL statement structure prediction.An industrial dataset based on the aluminum smelting industry was constructed and experimental verification was performed on this dataset. The results show that the exact match accuracy of the proposed model on the industrial test set is 74. 2%.Compared with the effect of the mainstream models on English dataset Spider, it can be seen that the proposed model can effectively deal with the Chinese industrial Text-to-SQL task.

关 键 词:中文Text-to-SQL任务 工业数据集 元数据 自注意力模型 基于变换器的双向编码器表示技术 

分 类 号:TP391.2[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象