检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李东闻 钟震宇 孙羽菲 申峻宇 马子智 于川越 张玉志 Li Dongwen;Zhong Zhenyu;Sun Yufei;Shen Junyu;Ma Zizhi;Yu Chuanyue;Zhang Yuzhi(College of Software,Nankai University,Tianjin 300450;Haihe Laboratory of Information Technology Application Innovation,Tianjin 300450)
机构地区:[1]南开大学软件学院,天津300450 [2]先进计算与关键软件(信创)海河实验室,天津300450
出 处:《计算机研究与发展》2025年第3期682-693,共12页Journal of Computer Research and Development
摘 要:近年来,大规模的、基于自回归的中文预训练语言模型在各种自然语言处理任务上表现出优异性能.然而,高昂的计算成本以及基于中文词切分数据给中文预训练语言模型实际应用带来了巨大挑战.此外,大多基于自回归的模型只能使用单向前文信息,可能会导致模型在上下文敏感任务上的性能有所下降.为了解决以上问题,提出并训练了一个高质量的小型中文预训练语言模型——玲珑.该模型仅有3.17亿个参数,较小的规模使得玲珑十分容易部署和应用.使用基于汉字的策略对训练语料进行切分,可以有效减轻未知标记和分词错误带来的负面影响,增强了玲珑在下游任务上的性能.此外,通过对每条训练数据的输入顺序进行逆序处理,训练了一个反向玲珑模型.将玲珑与其反向版本相结合,可以实现在下游任务中使用双向信息.多种自然语言处理下游任务的实验结果表明,玲珑具有不错的处理下游任务的能力.在6个数据集上玲珑超越了相近规模模型的性能,在5个数据集上超越了大模型的性能.In recent years,large-scale autoregressive Chinese pre-trained language models(PLMs)have demonstrated outstanding performance on various natural language processing(NLP)tasks.However,these models are computationally expensive,and their word-based vocabulary poses significant challenges for practical applications.In addition,most of them use only unidirectional context information,which may result in performance degradation on many tasks,especially tasks requiring a nuanced understanding of context.To address these challenges,we introduce LingLong,a high-quality small-scale Chinese pre-trained language model.LingLong stands out due to its modest scale,comprising only 317 million parameters,making it highly deployable and resource-efficient.We tokenize the training corpus with a character-based vocabulary to mitigate the negative impacts of unknown tokens and word segmentation errors.Moreover,we go beyond the conventional unidirectional context by introducing a novel backward model.This model is trained by reversing the input order of the training data.Combining LingLong and its backward version allows for the use of bidirectional information on downstream tasks.Extensive experimental results validate the effectiveness of LingLong across a diverse set of NLP tasks.LingLong outperforms similar-sized Chinese PLMs on six downstream tasks and surpasses popular large-scale Chinese PLMs on four downstream tasks.These findings underscore the versatility and efficiency of LingLong,opening up possibilities for practical applications and advancements in the Chinese NLP field.
关 键 词:中文预训练语言模型 小规模 基于汉字的模型 反向模型 双向信息
分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28