面向研究生招生咨询的中文Text-to-SQL模型  

Chinese Text-to-SQL Model for Postgraduate Admissions Consultation

作  者:王庆丰 李旭 姚春龙[1] 程腾腾 WANG Qingfeng;LI Xu;YAO Chunlong;CHENG Tengteng(School of Information Science and Engineering,Dalian Polytechnic University,Dalian 116034,Liaoning,China;Innovation and Entrepreneurship Center,Dalian Polytechnic University,Dalian 116034,Liaoning,China)

机构地区:[1]大连工业大学信息科学与工程学院,辽宁大连116034 [2]大连工业大学工程训练中心,辽宁大连116034

出  处:《计算机工程》2025年第3期362-368,共7页Computer Engineering

基  金:辽宁省教育厅青年科技人才“育苗”项目(J2020113);辽宁省教育厅科学研究项目(LJKZ0537);2024年度辽宁省属本科高校基本科研业务费专项资金资助项目(LJ212410152070)。

摘  要:研究生招生咨询是一种具有代表性的短时间高频次问答应用场景。针对现有基于词向量等方法的招生问答系统返回答案不够精确,以及每年需要更新问题库的问题,引入了基于文本转结构化查询语言(Text-to-SQL)技术的RESDSQL模型,可将自然语言问题转化为SQL语句后到结构化数据库中查询答案并返回。搜集了研究生招生场景中的高频咨询问题,根据3所高校真实招生数据,构建问题与SQL语句模板,通过填充模板的方式构建数据集,共有训练集1501条、测试集386条。将RESDSQL的RoBERTa模型替换为具有更强多语言生成能力的XLM-RoBERTa模型、T5模型替换为mT5模型,并在目标领域数据集上进行微调,在招生领域问题上取得了较高的准确率,在mT5-large模型上执行正确率为0.95,精确匹配率为1。与基于ChatGPT3.5模型、使用零样本提示的C3SQL方法对比,该模型性能与成本均更优。Postgraduate admissions consultation is a representative short-term and high-frequency Question-and-Answer(Q&A)application scenario.In response to the problem that the enrollment Q&A system based on the word vector method is not precise enough to return answers,and the problem of needing to update the question database every year,this paper introduces the RESDSQL model based on Text-to-Structured Query Language(SQL)technology to convert questions into SQL statements and then query answers in a structured database.This study collects high-frequency counseling questions in postgraduate admissions scenarios,establishes question and SQL statement templates based on real admissions data from three universities,and constructs a dataset by filling the templates,getting a dataset with a total of 1501 training sets and 386 validation sets.The RoBERTa model is replaced with the XLM-RoBERTa model that has a stronger multi-language generative capability,the T5 model is replaced with mT5 model,and the models are fine-tuned on the target domain dataset,achieving high accuracy on the enrollment domain problem,with execution accuracy of 0.95 and exact match of 1 on the RESDSQL model base on mT5-large.Compared with the C3SQL method based on ChatGPT3.5 model and zero-shot prompting,both performance and cost of the proposed method are better.

关 键 词:中文文本转结构化查询语言 自然语言查询 中文SQL语句生成 预训练模型 Text-to-SQL数据集 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象