检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]南京师范大学,210097 [2]北京大学,102600
出 处:《当代语言学》2009年第2期136-146,共11页Contemporary Linguistics
基 金:国家973项目(2004CB318102);江苏省社科基金项目(06JSBYY001);国家自然科学基金项目(60773173);国家社科基金项目(07BYY050)的支持
摘 要:在对现有词法标注器标注质量考察分析的基础上,本文提出语料库精加工的方法。利用这些方法,对从《人民日报》社购得的1998年上半年样例语料重新进行校对,从中排查并修改了5万余处切分和词性标注的错误或不一致,提高了样例语料的质量。本文提出的基于上下文词语相对词频比之和RFR_SUM的消歧模型,具有很好的分类效果。利用重新校对过的样例语料作为训练数据,再利用RFR_SUM模型对400余种常见歧义现象的消解进行训练,并将所得到的模型应用于超大规模语料的精加工,也取得良好的效果。This paper first examines critically the existing automatic proofreading technologies used in processing Chinese characters.It holds a distinction between shallow tagging and deep tagging.Shallow tagging refers to the use of the existing POS taggers to process texts without human correction of errors.Deep tagging,on the other hand,refers to the method of automatic tagging that improves shallow tagging.The proposed technology has been tested,and is found able to detect and correct more than 50,000 errors or inconsistencies in segmentation and POS tagging,using the template corpora.The proposed disambiguation model of PFR-SUM(sum of relative frequency ratio of words in context)shows excellent performance in classification,which detects a large amount of errors from template corpora and improves efficiency in corpora proofreading.The model also performs well in solving more than 400 types of common ambiguities when trained on the proofread template corpora and applied to large-scale corpora.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.222.132.108