基于MEGA网络和分层预测的标点恢复方法

Punctuation Restoration Method Based on MEGA Network and Hierarchical Prediction

作　　者：张文博[1] 黄浩[1,2] 吴迪唐敏杰 ZHANG Wenbo;HUANG Hao;WU Di;TANG Minjie(College of Computer Science and Technology,Xinjiang University,Urumqi 830046,Xinjiang,China;Xinjiang Key Laboratory of Multi-language Information Technology,Urumqi 830046,Xinjiang,China)

机构地区：[1]新疆大学计算机科学与技术学院,新疆乌鲁木齐830046 [2]新疆多语种信息技术重点实验室,新疆乌鲁木齐830046

出　　处：《计算机工程》2024年第12期396-406,共11页Computer Engineering

基　　金：科技创新2030-“新一代人工智能”重大项目(2020AAA0107902)。

摘　　要：标点恢复又称标点预测,是指对一段没有标点的文本添加合适的标点,以提高文本的可读性,是一项经典的自然语言处理任务。随着预训练模型的发展和标点恢复研究的深入,标点恢复任务的性能在不断提升。然而,基于Transformer结构的预训练模型在提取长序列输入的局部信息方面存在局限性,不利于最终标点符号的预测。此外,以往的研究将标点标签视为要预测的符号,忽略了不同标点的场景属性和标点间的关系。为了解决这些问题,引入移动平均门控注意力(MEGA)网络作为辅助模块,以增强模型对局部信息的提取能力。同时,构建分层预测模块,充分利用不同标点符号的场景属性和标点间的关系进行最终的分类。使用多种基于Transformer结构的预训练模型在不同语言的数据集上进行实验,在英文标点数据集IWSLT上的实验结果表明,在多数预训练模型上应用MEGA模块和分层预测模块都能获得性能增益,使用DeBERTaV3 xlarge在IWSLT的REF测试集上的F1值达到85.5%,相比于基线提升了1.2个百分点。此外,在中文标点数据集的实验中也取得较高的精度。Punctuation restoration,also known as punctuation prediction,refers to the task of adding appropriate punctuation marks to a text without punctuation to enhance its readability.This is a classic Natural Language Processing(NLP)task.In recent years,with the development of pretraining models and deepening research on punctuation restoration,the performance of punctuation restoration tasks has continuously improved.However,Transformer-based pretraining models have limitations in extracting local information from long-sequence inputs,which hinders the prediction of the final punctuation marks.In addition,previous studies have treated punctuation labels as symbols to be predicted by overlooking the contextual attributes of different punctuation marks and their relationships.To address these issues,this study introduces a Moving average Equipped Gated Attention(MEGA)network as an auxiliary module to enhance the ability of the model to extract local information.Moreover,a hierarchical prediction module is constructed to fully utilize the contextual attributes of different punctuation marks and the relationships between them for the final classification.Experiments are conducted using various transformer-based pretraining models on datasets in different languages.The experimental results on the English punctuation dataset IWSLT demonstrate that applying the MEGA and hierarchical prediction modules to most pretraining models leads to performance gains.Notably,DeBERTaV3 xlarge achieved an F1 score of 85.5%on the REF test set of the IWSLT,which is a 1.2 percentage points improvement compared to the baseline.The proposed model achieved the highest accuracy for the Chinese punctuation dataset.

关键词：标点恢复自然语言处理预训练模型 Transformer结构分层预测

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MEGA网络和分层预测的标点恢复方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MEGA网络和分层预测的标点恢复方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索