融合字特征的平滑最大熵模型消解交集型歧义  被引量:3

Resolution of Overlapping Ambiguity Strings Based on Smoothed Maximum Entropy Model with Character Feature

在线阅读下载全文

作  者:任惠[1] 林鸿飞[1] 杨志豪[1] 

机构地区:[1]大连理工大学计算机科学与技术学院,辽宁大连116024

出  处:《中文信息学报》2010年第4期18-24,共7页Journal of Chinese Information Processing

基  金:国家自然科学基金资助项目(60673039;60973068);国家社科基金资助项目(08BTQ025);国家高科技863计划资助项目(2006AA01Z151);教育部博士点基金资助项目(20090041110002)

摘  要:交集型歧义的切分问题是分词阶段需要解决难点之一。该文将交集型歧义的消解问题转化为分类问题,并利用融合丰富字特征的最大熵模型解决该问题,为了克服最大熵建模时的数据稀疏问题,该文引入了不等式平滑技术和高斯平滑技术。我们在第二届国际分词竞赛的四个数据集上比较了高斯平滑技术、不等式平滑技术和频度折扣平滑技术,测试结果表明:不等式平滑技术和高斯平滑技术比频度折扣技术有显著提高,而它们之间不分伯仲,但是不等式平滑技术能使特征选择无缝嵌入到参数估计过程中,显著压缩模型规模。该方法在四个测试集上最终获得了96.27%、96.83%、96.56%、96.52%的消歧正确率,对比实验表明:丰富的特征使消歧性能分别提高了5.87%、5.64%、5.00%、5.00%,平滑技术使消歧性能分别提高了0.99%、0.93%、1.02%、1.37%,不等式平滑使分类模型分别压缩了38.7、19.9、44.6、9.7。The overlapping ambiguity strings(OAS) is one of the difficulties in automatic Chinese word segmentation.This paper treats the resolution of OAS asa classification task,using maximum entropy integrating character features to solve the problem.In order to overcome the data sparseness in maximum entropy modeling,this paper introduces the inequality smoothing techniques and Gaussian smoothing techniques.We compared the Gaussian smoothing,inequality smoothing and frequency discount on the four datasets of the Second International Chinese Word Segmentation,proving that Gaussian smoothing,inequality smoothing are much better than the discount method..while inequality smoothing enables the seamless integration of feature selectioninto the parameter estimation with the result of a significantly compressed model.On the four datasets,the precision of disambiguation by the proposed method can achieve 96.27%,96.83%,96.56%,96.52% respectively,with a relative improvement of 5.87%,5.64%,5.00%,5.00% by the rich feature and a relative improvement of 5.87%,5.64%,5.00%,5.00% by smoothing technology.Meanwhile,the classification models are compressed by 38.7,19.9,44.6,9.7 by using inequality smoothing.

关 键 词:计算机应用 中文信息处理 分词 交集型歧义 融合丰富字特征 最大熵模型 平滑技术 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象