检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《数据分析与知识发现》2017年第5期62-70,共9页Data Analysis and Knowledge Discovery
基 金:国家社会科学基金重大项目"汉语史研究语料库建设研究"(项目编号:10&ZD117);国家社会科学基金重大项目"基于<汉学引得丛刊>的典籍知识库构建及人文计算研究"(项目编号:15ZDB127)的研究成果之一;教育部人文社会科学青年项目"汉语历时词汇数据库的构建与计量研究"(项目编号:16YJC740034)
摘 要:【目的】验证中古时期分词一致性和语料类别对CRFs分词效率的影响,在此基础上进一步提高分词效率,降低人工校对的工作量。【方法】以中古时期的史书、佛经、小说类语料为例,针对中古汉语的自动分词问题,优化分词原则,运用CRFs模型和词典相结合的方法,消除中古汉语人工分词结果中易出现的分词不一致问题;同时在CRFs分词中引入字符分类、字典信息两种特征,并通过对比实验选取每种特征最合适的分词模板。【结果】实验结果显示,分词结果的总F值在封闭测试中达到99%以上,开放测试的综合测试中也达到89%-95%。【局限】分词不一致研究主要针对双字词,因此三字以上词语(多字词)的识别效果稍有欠缺。【结论】在有效提高分词一致性的前提下,字符分类、词典标记特征能够有效提高中古汉语CRFs分词的精确度。同时本文提出的中古汉语分词系统可以服务于中古时期多类别的汉语语料。[Objective] The purpose of this paper is to explore the influence of the word segmentation consistency and the corpus types in Middle Ancient Chinese (MAC). It tries to improve the accuracy and efficiency of the automatic word segmentation, a basic procedure in processing ancient Chinese, based on the CRFs model. [Methods] First, we optimized the segmentation principles for MAC historical records, Buddhist scriptures and novels. Then, we combined the CRFs model with dictionary to reduce the segmentation inconsistency in the manual procedures. Finally, we added two features to the CRFs model (i.e. character classification segmentation template by comparison experiments. [Results] and dictionary information), and identified the best word The F-score was higher than 99% in the closed test, while it was from 89% to 95% in the open test. [Limitations] The segmentation consistency was improved on the words with two characters, and more studies were needed on the segmentation of words with more than three characters. [Conclusions] The proposed method could effectively improve the accuracy of automatic word segmentation for mediaeval Chinese corpus.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.137.172.252