基于可控性解释的混合数据增强框架

Hybrid Data Augmentation Framework Based on Controllable Explanation

作　　者：孙泽辰肖义胜李俊涛张民[1] 周国栋[1] SUN Ze-Chen;XIAO Yi-Sheng;LI Jun-Tao;ZHANG Min;ZHOU Guo-Dong(School of Computer Science and Technology,Soochow University,Suzhou 215008,China)

机构地区：[1]苏州大学计算机科学与技术学院,江苏苏州215008

出　　处：《软件学报》2025年第4期1604-1619,共16页Journal of Software

基　　金：国家自然科学基金(62206194);江苏省自然科学基金(BK20220488)。

摘　　要：先前的预训练语言模型已在众多自然语言理解任务中展现了其卓越的性能.然而,它们常表现出捷径学习的问题,即学习了非鲁棒性特征与标签之间的虚假关联,导致模型在不同于训练分布的测试场景中的泛化能力不佳.近期,生成式预训练大模型在理解任务中的出色表现引起了广泛的关注,但它们是否受到捷径学习的影响尚未被充分研究.以LLaMA系列模型与FLAN-T5模型为代表,探究生成式预训练大模型在多个自然语言理解任务中的捷径学习现象.研究结果表明,近期流行的生成式大模型仍然存在捷径学习的问题.进而,提出针对生成式预训练大模型的捷径学习问题的缓解策略——基于可控性解释的混合数据增强框架.该框架以数据为中心,基于模型生成的可控性解释数据与部分原始提示性数据构造小规模混合数据集,开展模型微调.在3个具有代表性的自然语言理解任务中的大量实验结果表明,使用该框架所构造的数据集训练模型能够有效缓解模型的捷径学习问题,提升模型在分布外测试场景中的鲁棒性与泛化能力,同时不牺牲甚至提升模型在分布内测试场景中的性能.代码已公开发布在https://github.com/Mint9996/HEDA.Previous pre-trained language models(PLMs)have demonstrated excellent performance in numerous tasks of natural language understanding(NLU).However,they generally suffer shortcut learning,which means learning the spurious correlations between non-robust features and labels,resulting in poor generalization in out-of-distribution(OOD)test scenarios.Recently,the outstanding performance of generative large language models(LLMs)in understanding tasks has attracted widespread attention,but the extent to which it is affected by shortcut learning has not been fully studied.In this paper,the shortcut learning effect of generative LLMs in three NLU tasks is investigated for the first time using the LLaMA series models and FLAN-T5 models as representatives.The results show that the shortcut learning problem still exists in generative LLMs.Therefore,a hybrid data augmentation framework is proposed based on controllable explanations as a mitigation strategy for the shortcut learning problem in generative LLMs.The framework is data-centric,constructing a small-scale mix dataset composed of model-generated controllable explain data and partial original prompting data for model fine-tuning.The experimental results in three representative NLU tasks show that the framework can effectively mitigate shortcut learning,and significantly improve the robustness and generalization of the model in OOD test scenarios while avoiding sacrifice of or even improving the model performance in in-distribution test scenarios.The solution code is available at https://github.com/Mint9996/HEDA.

关键词：捷径学习生成式预训练语言模型自然语言理解

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于可控性解释的混合数据增强框架

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于可控性解释的混合数据增强框架

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索