玲珑:一个小规模的高质量中文预训练语言模型  

LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model

在线阅读下载全文

作  者:李东闻 钟震宇 孙羽菲 申峻宇 马子智 于川越 张玉志 Li Dongwen;Zhong Zhenyu;Sun Yufei;Shen Junyu;Ma Zizhi;Yu Chuanyue;Zhang Yuzhi(College of Software,Nankai University,Tianjin 300450;Haihe Laboratory of Information Technology Application Innovation,Tianjin 300450)

机构地区:[1]南开大学软件学院,天津300450 [2]先进计算与关键软件(信创)海河实验室,天津300450

出  处:《计算机研究与发展》2025年第3期682-693,共12页Journal of Computer Research and Development

摘  要:近年来,大规模的、基于自回归的中文预训练语言模型在各种自然语言处理任务上表现出优异性能.然而,高昂的计算成本以及基于中文词切分数据给中文预训练语言模型实际应用带来了巨大挑战.此外,大多基于自回归的模型只能使用单向前文信息,可能会导致模型在上下文敏感任务上的性能有所下降.为了解决以上问题,提出并训练了一个高质量的小型中文预训练语言模型——玲珑.该模型仅有3.17亿个参数,较小的规模使得玲珑十分容易部署和应用.使用基于汉字的策略对训练语料进行切分,可以有效减轻未知标记和分词错误带来的负面影响,增强了玲珑在下游任务上的性能.此外,通过对每条训练数据的输入顺序进行逆序处理,训练了一个反向玲珑模型.将玲珑与其反向版本相结合,可以实现在下游任务中使用双向信息.多种自然语言处理下游任务的实验结果表明,玲珑具有不错的处理下游任务的能力.在6个数据集上玲珑超越了相近规模模型的性能,在5个数据集上超越了大模型的性能.In recent years,large-scale autoregressive Chinese pre-trained language models(PLMs)have demonstrated outstanding performance on various natural language processing(NLP)tasks.However,these models are computationally expensive,and their word-based vocabulary poses significant challenges for practical applications.In addition,most of them use only unidirectional context information,which may result in performance degradation on many tasks,especially tasks requiring a nuanced understanding of context.To address these challenges,we introduce LingLong,a high-quality small-scale Chinese pre-trained language model.LingLong stands out due to its modest scale,comprising only 317 million parameters,making it highly deployable and resource-efficient.We tokenize the training corpus with a character-based vocabulary to mitigate the negative impacts of unknown tokens and word segmentation errors.Moreover,we go beyond the conventional unidirectional context by introducing a novel backward model.This model is trained by reversing the input order of the training data.Combining LingLong and its backward version allows for the use of bidirectional information on downstream tasks.Extensive experimental results validate the effectiveness of LingLong across a diverse set of NLP tasks.LingLong outperforms similar-sized Chinese PLMs on six downstream tasks and surpasses popular large-scale Chinese PLMs on four downstream tasks.These findings underscore the versatility and efficiency of LingLong,opening up possibilities for practical applications and advancements in the Chinese NLP field.

关 键 词:中文预训练语言模型 小规模 基于汉字的模型 反向模型 双向信息 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象