检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]北京科技大学信息工程学院,北京100083 [2]搜狐研发中心,北京100084
出 处:《清华大学学报(自然科学版)》2008年第4期608-612,共5页Journal of Tsinghua University(Science and Technology)
基 金:国家自然科学基金资助项目(60675006)
摘 要:以往的词汇搭配抽取统计评价方法具有大致相同的效果,它们各有优劣,可以实现优势互补。该文提出多策略融合的搭配抽取方法。首先,将互信息法用于衡量二元独立性,淘汰候选的无关二元组。其次,对比2χ检验法与t检验法,发现使用2χ检验法能够更合理地反映搭配组合的同现性和期待性;然后,使用对数似然比检验法,解决其他方法无法克服的稀疏数据问题。此外,加入构词法的启发式规则,最终形成一个多策略融合的方法。实验结果表明该方法的准确率较高,在实际应用中取得良好效果。Previous research on lexical collocation extraction have considered that most statistical evaluation approaches have the same effective ness, however, analyses show that these approaches have different advantages and disadvantages so they can complement each other. This paper presents a collocation extraction approach with multiple hybrid strategies. Mutual information is used to measure the independence of two meta to discard irrelevant data. The :χ^2-test was found to more than reasonably depict the concurrency and foreseeability of the collocation. The log likelihood ratio is used to solve the spare data problem which limits other methods. The word-formation rules are then added to build a logical collocation extraction approach with multiple hybrid strategies. The result of the experiment shows that this method has higher accuracy and works well in practice.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249