超大规模语料库精加工技术研究  被引量:4

Research on deep processing technologies for large-scale corpora

在线阅读下载全文

作  者:曲维光[1] 唐旭日[1] 俞敬松[2] 

机构地区:[1]南京师范大学,210097 [2]北京大学,102600

出  处:《当代语言学》2009年第2期136-146,共11页Contemporary Linguistics

基  金:国家973项目(2004CB318102);江苏省社科基金项目(06JSBYY001);国家自然科学基金项目(60773173);国家社科基金项目(07BYY050)的支持

摘  要:在对现有词法标注器标注质量考察分析的基础上,本文提出语料库精加工的方法。利用这些方法,对从《人民日报》社购得的1998年上半年样例语料重新进行校对,从中排查并修改了5万余处切分和词性标注的错误或不一致,提高了样例语料的质量。本文提出的基于上下文词语相对词频比之和RFR_SUM的消歧模型,具有很好的分类效果。利用重新校对过的样例语料作为训练数据,再利用RFR_SUM模型对400余种常见歧义现象的消解进行训练,并将所得到的模型应用于超大规模语料的精加工,也取得良好的效果。This paper first examines critically the existing automatic proofreading technologies used in processing Chinese characters.It holds a distinction between shallow tagging and deep tagging.Shallow tagging refers to the use of the existing POS taggers to process texts without human correction of errors.Deep tagging,on the other hand,refers to the method of automatic tagging that improves shallow tagging.The proposed technology has been tested,and is found able to detect and correct more than 50,000 errors or inconsistencies in segmentation and POS tagging,using the template corpora.The proposed disambiguation model of PFR-SUM(sum of relative frequency ratio of words in context)shows excellent performance in classification,which detects a large amount of errors from template corpora and improves efficiency in corpora proofreading.The model also performs well in solving more than 400 types of common ambiguities when trained on the proofread template corpora and applied to large-scale corpora.

关 键 词:语料精加工 RFR_SUM模型 样例语料 粗标语料 

分 类 号:H03[语言文字—语言学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象