检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《计算机应用与软件》2015年第3期37-40,64,共5页Computer Applications and Software
摘 要:在基于半监督和主动学习的信息抽取研究中,对初始样本集的选择,鲜有考虑样本在数据集中的分布情况。以林业领域的病虫害抽取为例,提出基于聚类的方法来获取样本在数据集中的分布信息,以此指导初始样本集和迭代过程中标注样本的选择。实验结果表明,基于聚类的方法相比于随机初始训练集,在不同标注样本集个数的情况下,模型f值均有提高。相比于单一的主动学习方法,在性能相近的情况下,节约了30%左右的人工标注量。In study of semi-supervised learning and active learning based information extraction,it is scarcely to consider the distribution condition of samples in dataset when selecting the initial sample set. Taking pest and diseases extraction in forestry field as example,we propose the clustering-based method to obtain distribution information of samples in dataset,and use it to guide the selection of initial training set and the annotated samples in iteration process. Experimental results demonstrate that compared with the random initial training set,the clustering-based method improves f values of the model under the condition of that the numbers of annotated sample set are different. And compared with the single active learning method,the manual annotation amount is also saved about 30% under the condition of similar performance.
关 键 词:信息抽取 文本聚类 林业病虫害实体 主动学习 半监督学习
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.219.31.133