检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张文博[1] 黄浩[1,2] 吴迪 唐敏杰 ZHANG Wenbo;HUANG Hao;WU Di;TANG Minjie(College of Computer Science and Technology,Xinjiang University,Urumqi 830046,Xinjiang,China;Xinjiang Key Laboratory of Multi-language Information Technology,Urumqi 830046,Xinjiang,China)
机构地区:[1]新疆大学计算机科学与技术学院,新疆乌鲁木齐830046 [2]新疆多语种信息技术重点实验室,新疆乌鲁木齐830046
出 处:《计算机工程》2024年第12期396-406,共11页Computer Engineering
基 金:科技创新2030-“新一代人工智能”重大项目(2020AAA0107902)。
摘 要:标点恢复又称标点预测,是指对一段没有标点的文本添加合适的标点,以提高文本的可读性,是一项经典的自然语言处理任务。随着预训练模型的发展和标点恢复研究的深入,标点恢复任务的性能在不断提升。然而,基于Transformer结构的预训练模型在提取长序列输入的局部信息方面存在局限性,不利于最终标点符号的预测。此外,以往的研究将标点标签视为要预测的符号,忽略了不同标点的场景属性和标点间的关系。为了解决这些问题,引入移动平均门控注意力(MEGA)网络作为辅助模块,以增强模型对局部信息的提取能力。同时,构建分层预测模块,充分利用不同标点符号的场景属性和标点间的关系进行最终的分类。使用多种基于Transformer结构的预训练模型在不同语言的数据集上进行实验,在英文标点数据集IWSLT上的实验结果表明,在多数预训练模型上应用MEGA模块和分层预测模块都能获得性能增益,使用DeBERTaV3 xlarge在IWSLT的REF测试集上的F1值达到85.5%,相比于基线提升了1.2个百分点。此外,在中文标点数据集的实验中也取得较高的精度。Punctuation restoration,also known as punctuation prediction,refers to the task of adding appropriate punctuation marks to a text without punctuation to enhance its readability.This is a classic Natural Language Processing(NLP)task.In recent years,with the development of pretraining models and deepening research on punctuation restoration,the performance of punctuation restoration tasks has continuously improved.However,Transformer-based pretraining models have limitations in extracting local information from long-sequence inputs,which hinders the prediction of the final punctuation marks.In addition,previous studies have treated punctuation labels as symbols to be predicted by overlooking the contextual attributes of different punctuation marks and their relationships.To address these issues,this study introduces a Moving average Equipped Gated Attention(MEGA)network as an auxiliary module to enhance the ability of the model to extract local information.Moreover,a hierarchical prediction module is constructed to fully utilize the contextual attributes of different punctuation marks and the relationships between them for the final classification.Experiments are conducted using various transformer-based pretraining models on datasets in different languages.The experimental results on the English punctuation dataset IWSLT demonstrate that applying the MEGA and hierarchical prediction modules to most pretraining models leads to performance gains.Notably,DeBERTaV3 xlarge achieved an F1 score of 85.5%on the REF test set of the IWSLT,which is a 1.2 percentage points improvement compared to the baseline.The proposed model achieved the highest accuracy for the Chinese punctuation dataset.
关 键 词:标点恢复 自然语言处理 预训练模型 Transformer结构 分层预测
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.17.183.238