基于语义上下文感知的文本数据增强方法研究  

Text data augmentation method based on semantic context awareness

在线阅读下载全文

作  者:张军[1] 况泽 李钰彬 ZHANG Jun;KUANG Ze;LI Yubin(School of Information Engineering,East China University of Technology,Nanchang 330013,China)

机构地区:[1]东华理工大学信息工程学院,江西南昌330013

出  处:《现代电子技术》2024年第17期159-165,共7页Modern Electronics Technique

基  金:国家自然科学基金资助项目(62162002);国家自然科学基金资助项目(61662002);江西省自然科学基金资助项目(20212BAB202002)。

摘  要:在文本分类任务中,数据的质量和数量对分类模型的性能有着重要影响,而在现实场景中获取大规模标记数据往往是昂贵和困难的。数据增强作为一种解决数据匮乏问题的低成本方法,已在各种深度学习和机器学习任务中取得了显著效果。由于文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难。因此,提出基于语义上下文感知的数据增强方法,采用由WordNet 3.0中的词义定义(Gloss)和预训练模型BERT进行整合的Gloss选择模型,进一步识别上下文中目标词(尤其是多义词)的实际词义;然后根据下一个句子预测策略,将目标词的实际词义与被遮盖目标词的句子结合为一个句子对,使用掩码语言模型对句子对进行预测采样;最后计算语义文本相似度,并在三个基准分类数据集上对文中方法进行验证。实验结果表明,提出的方法在语义保留条件下,与选取的基线数据增强方法相比,在三个数据集的平均准确率指标上都有所提升,证明了文中方法的有效性。In text classification tasks,the quality and quantity of data have a significant impact on the performance of classification models.Usually,it is costly and difficult to obtain large-scaled labeled data in real scenarios.Data augmentation(DA),as a low-cost method to cope with the data desert,has achieved significant results in various deep learning and machine learning tasks.Due to the discrete nature of text language,it is difficult to perform DA in case of semantic preservation.Therefore,a DA method based on semantic context awareness is proposed.The Gloss selection model integrated by the word sense definition(Gloss)in WordNet 3.0 and the pre-training model BERT is employed,so as to identify the actual word senses of the target words(especially polysemous words)in the context.According to the next sentence prediction strategy,the actual word senses of the target words and the sentences of the masked target words are combined into a sentence pair,which are subjected to prediction sampling with a masked language model(MLM).The semantic text similarity is calculated.The proposed method is validated on three benchmark categorization datasets.The experimental results show that the average accuracy of the proposed method on the three datasets is improved to some extent in case of semantic preservation in comparison with the selected baseline data enhancement methods,which proves the effectiveness of the proposed method.

关 键 词:人工智能 自然语言处理 文本分类 数据增强 GLOSS 低资源 

分 类 号:TN919-34[电子电信—通信与信息系统] TP391[电子电信—信息与通信工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象