基于最近邻的主动学习分词方法被引量：1

Active Learning in Chinese Word Segmentation Based on Nearest Neighbor

出　　处：《计算机科学》2015年第6期228-232,261,共6页Computer Science

基　　金：国家自然科学基金(61302157);教育部人文社会科学研究青年基金(12YJC870008);江苏省教育厅高校哲学社会科学基金(2013SJB870004);江苏省社科研究文化精品课题(12SWC-030)资助

摘　　要：分词是中文自然语言处理中的一项关键基础技术。为了解决训练样本不足以及获取大量标注样本费时费力的问题,提出了一种基于最近邻规则的主动学习分词方法。使用新提出的选择策略从大量无标注样本中选择最有价值的样本进行标注,再把标注好的样本加入到训练集中,接着使用该集合来训练分词器。最后在PKU数据集、MSR数据集和山西大学数据集上进行测试,并与传统的基于不确定性的选择策略进行比较。实验结果表明,提出的最近邻主动学习方法在进行样本选择时能够选出更有价值的样本,有效降低了人工标注的代价,同时还提高了分词结果的准确率。As the basis of Chinese information processing, Chinese word segmentation（CWS） plays a very important role. To solve the problems of lacking of training samples and accessing a large number of labeled samples laboriously, a fresh active learning method based on nearest neighbor was proposed. The method adopts CRFs as the basic frame- work and uses the proposed active learning sampling strategy to select the most useful instances to annotate from a large number of unlabeled samples. Next the annotated are put instances into the labeled set and then the segmenter is trained by using the labeled set. Finally the method was tested in PKU corpora, MSR corpora and shanxi university corpora, and compared with the uncertainty sampling strategy. The experiment result shows that the fresh active learning selection strategy can select more valuable samples, reduce the cost of manual annotation effectively, and improve the accuracy of segmentation.

关键词：中文分词主动学习不确定性取样最近邻规则

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于最近邻的主动学习分词方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于最近邻的主动学习分词方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于最近邻的主动学习分词方法被引量：1