基于小样本数据增强的科技文档不平衡分类研究  被引量:2

Research of Imbalanced Classification for Technical Documents Based on Few-shot Data Augmentation

在线阅读下载全文

作  者:黄金凤 高岩[1] 徐童[1] 陈恩红[1] HUANG Jin-feng;GAO Yan;XU Tong;CHEN En-hong(School of Computer Science,University of Science and Technology of China,Hefei 230027,China)

机构地区:[1]中国科学技术大学计算机学院,安徽合肥230027

出  处:《工程管理科技前沿》2022年第3期23-30,共8页Frontiers of Science and Technology of Engineering Management

基  金:国家重点研发计划资助项目(2018YFB1402600)。

摘  要:科学技术的飞速发展衍生出海量的科技文档,其有效管理与查询依赖于准确的文档自动化分类。然而,由于学科门类众多且发展各异,导致相关文档数量存在严重的不平衡现象,削弱了分类技术的有效性。虽然相关研究证实预训练语言模型在文本分类任务上能够取得很好的效果,但由于科技文档较强的领域性导致通用预训练模型难以取得良好效果。更重要的是,不同领域积累的文档数量存在显著差异,其不平衡分类问题仍未完善解决。针对上述问题,本文通过引入和改进多种数据增强策略,提升了小样本类别的数据多样性与分类鲁棒性,进而通过多组实验讨论了不同预训练模型下数据增强策略的最佳组合方式。结果显示,本文所提出的技术框架能够有效提升科技文档不平衡分类任务的精度,从而为实现科技文档自动化分类及智能应用奠定了基础。Recent years have witnessed the rapid development of science and technologies,which results in the abundant technical documents.Along this line,automatic classification tools are urgently required to support the management and retrieval of technical documents.Though prior arts have mentioned that the pre-trained models could achieve competitive performance on textual classification tasks,considering the domain-specific characters of technical documents,effectiveness of these pre-trained models might be still limited.Even worse,due to the imbalanced accumulation of documents for different research fields,there exists the severe imbalanced classification issue,which impair the effectiveness of classification tool.To deal with these issues,in this paper,we propose a comprehensive framework,which adapts the multiple data augmentation strategies,for improving the diversity and robustness of document samples in few-shot categories.Moreover,extensive validations have been executed to reveal the most effective combination of data augmentation strategies under different pre-trained models.The results indicate that our proposed framework could effectively improve the performance of imbalanced classification issue,and further support the intelligent services on technical documents.

关 键 词:文本分类 预训练模型 类别不平衡 数据增强 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象