多策略融合的搭配抽取方法  被引量:9

Collocation extraction with multiple hybrid strategies

在线阅读下载全文

作  者:王大亮[1] 涂序彦[1] 郑雪峰[1] 佟子健 

机构地区:[1]北京科技大学信息工程学院,北京100083 [2]搜狐研发中心,北京100084

出  处:《清华大学学报(自然科学版)》2008年第4期608-612,共5页Journal of Tsinghua University(Science and Technology)

基  金:国家自然科学基金资助项目(60675006)

摘  要:以往的词汇搭配抽取统计评价方法具有大致相同的效果,它们各有优劣,可以实现优势互补。该文提出多策略融合的搭配抽取方法。首先,将互信息法用于衡量二元独立性,淘汰候选的无关二元组。其次,对比2χ检验法与t检验法,发现使用2χ检验法能够更合理地反映搭配组合的同现性和期待性;然后,使用对数似然比检验法,解决其他方法无法克服的稀疏数据问题。此外,加入构词法的启发式规则,最终形成一个多策略融合的方法。实验结果表明该方法的准确率较高,在实际应用中取得良好效果。Previous research on lexical collocation extraction have considered that most statistical evaluation approaches have the same effective ness, however, analyses show that these approaches have different advantages and disadvantages so they can complement each other. This paper presents a collocation extraction approach with multiple hybrid strategies. Mutual information is used to measure the independence of two meta to discard irrelevant data. The :χ^2-test was found to more than reasonably depict the concurrency and foreseeability of the collocation. The log likelihood ratio is used to solve the spare data problem which limits other methods. The word-formation rules are then added to build a logical collocation extraction approach with multiple hybrid strategies. The result of the experiment shows that this method has higher accuracy and works well in practice.

关 键 词:信息处理 搭配抽取 统计评价 自然语言处理 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象