互信息改进方法在术语抽取中的应用  被引量:19

Application of improved point-wise mutual information in term extraction

在线阅读下载全文

作  者:杜丽萍[1] 李晓戈[1] 周元哲[1] 邵春昌 

机构地区:[1]西安邮电大学计算机学院,西安710121 [2]中央民族大学理学院,北京100081

出  处:《计算机应用》2015年第4期996-1000,1005,共6页journal of Computer Applications

基  金:国家自然科学基金资助项目(61373116);西安邮电大学研究生创新基金资助项目(ZL2013-31)

摘  要:为了确定改进互信息(PMIk)方法的参数k取何值时能够克服互信息(PMI)方法过高估计两个低频且总是一起出现的字串间结合强度的缺点,解决术语抽取系统采用经过分词的语料库时由于分词错误导致的某些术语无法抽取的问题,以及改善术语抽取系统的可移植性,提出了一种结合PMIk和两个基本过滤规则从未经过分词的语料库中进行术语抽取的算法。首先,利用PMIk方法计算两个字之间的结合强度,确定2元待扩展种子;其次,利用PMIk方法计算2元待扩展种子分别和其左边、右边的字的结合强度,确定2元是否能扩展为3元,如此迭代扩展出多元的候选术语;最后,利用两个基本过滤规则过滤候选术语中的垃圾串,得到最终结果。理论分析表明,当k≥3(k∈N+)时,PMIk方法能克服PMI方法的缺点。在1 GB的新浪财经博客语料库和300 MB百度贴吧语料库上的实验验证了理论分析的正确性,且PMIk方法获得了比PMI方法更高的精度,算法有良好的可移植性。The traditional Point-wise Mutual Information( PMI) method has shortcoming of overvaluing the co-occurrence of two low-frequency words. To get the proper value of k of improved PMI named PMIkto overcome the shortcoming of PMI,and solve the problem that the term extraction cannot be obtained from a segmented corpus with segmentation errors, as well as maintaining the portability of term extraction system, combining with the PMIkmethod and two fundamental rules, a new method was put forward to identity terms from an unsegmented corpus. Firstly, 2-gram extended seed was determined by computing the bonding strength of two adjoining words by PMIkmethod. Secondly, whether the 2-gram extended seed could be extended to 3-gram was determined by respectively computing the bonding strength between the seed and the word in front of it and the word located behind it, and then getting multi-gram term candidates iteratively. Finally, the garbage of term candidates were filtered using the two fundamental rules to obtain terms. The theoretical analysis shows that PMIkcan overcome the shortcoming of PMI when k≥3( k∈N+). The experiments on 1 GB SINA finance Blog corpus and 300 MB Baidu Tieba corpus verify the theoretical analysis, and PMIkoutperforms PMI with good portability.

关 键 词:术语抽取 专业术语 知识获取 互信息 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象