检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:胡昊天 邓三鸿[1,2] 张逸勤 张琪 孔嘉 王东波 Hu Haotian;Deng Sanhong;Zhang Yiqin;Zhang Qi;Kong Jia;Wang Dongbo(School of Information Management,Nanjing University;Jiangsu Key Laboratory of Data Engineering and Knowledge Service;School of Information Management,Nanjing Agricultural University)
机构地区:[1]南京大学信息管理学院,江苏南京210023 [2]江苏省数据工程与知识服务重点实验室 [3]南京农业大学信息管理学院,江苏南京210095
出 处:《图书馆杂志》2022年第8期76-83,共8页Library Journal
基 金:国家社科基金重点项目“大数据环境下领域知识加工与组织模式研究”(项目编号:20ATQ006)的研究成果之一。
摘 要:文本自动分词是非物质文化遗产相关数字人文研究的基础与关键步骤,是深度发掘非遗内在信息的前提。文章构建了国家级非物质文化遗产项目申报文本自动分词模型,探究了融入领域知识的机器学习模型CRF、深度学习模型Bi-LSTM-CRF和预训练语言模型BERT、RoBERTa、ALBERT在非遗文本上的分词性能,并对比了通用分词工具HanLP、Jieba、NLPIR的效果。在全部14种模型中,RoBERTa模型效果最佳,F值达到了97.28%,预训练模型中ALBERT在同等条件下训练速度最快。调用分词模型,构建了非遗文本领域词表和全文分词语料库,对非遗文本词汇分布情况进行了分析挖掘。开发了中国非物质文化遗产文本自动分词系统(CITS),为非遗文本自动分词及分词结果的多维可视化分析提供了工具。Automatic word segmentation is the foundation and key step of digital humanities research related to intangible cultural heritage,and it is the prerequisite to in-depth exploration of intangible cultural heritage internal information.We constructed automatic word segmentation models for the application text of national intangible cultural heritage projects.We compared the segmentation performance of CRF,Bi-LSTM-CRF,BERT,RoBERTa and ALBERT on intangible cultural heritage texts.And,the results of Han LP,Jieba,and NLPIR,general CWS tools were compared.In all 14 models,the RoBERTa model had the best effect,with an F-score of 97.28%,and ALBERT had the fastest training speed under the same conditions of PTMs.The word segmentation model was used to construct the intangible cultural heritage text domain vocabulary and segmentation corpus,whereas the intangible cultural heritage text vocabulary distribution was analyzed and mined.We developed the Chinese Intangible Cultural Heritage Text Automatic Segmentation System (CITS),which provided a tool for the automatic segmentation of intangible cultural heritage texts and the multi-dimensional visual analysis of the segmentation results.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117