基于字串内部结合紧密度的汉语自动抽词实验研究  被引量:32

Chinese Word Extraction Based on the Internal Associative Strength of Character Strings

在线阅读下载全文

作  者:罗盛芬[1] 孙茂松[1] 

机构地区:[1]智能技术与系统国家重点实验室清华大学计算机科学与技术系,北京100084

出  处:《中文信息学报》2003年第3期9-14,共6页Journal of Chinese Information Processing

基  金:国家 973资助项目 (G19980 30 5 0 7)

摘  要:自动抽词是文本信息处理中的重要课题之一。当前比较通行的解决策略是通过评估候选字串内部结合紧密度来判断该串成词与否。本文分别考察了九种常用统计量在汉语自动抽词中的表现 ,进而尝试将它们组合在一起 ,以期提高性能。为了达到尽可能好的组合效果 ,采用了遗传算法来自动调整组合权重。对二字词的自动抽词实验结果表明 ,这九种常用统计量中 ,互信息的抽词能力最强 ,F measure可达 5 4 77% ,而组合后的F measure为 5 5 4 7% ,仅比互信息提高了 0 70 % ,效果并不显著。我们的结论是 :( 1)上述统计量并不具备良好的互补性 ;( 2 )通常情况下 ,建议直接选用互信息进行自动抽词 ,简单有效。Word extraction is one of the important tasks in text information processing. A conventional scheme for word extraction is to estimate the soundness of a candidate character string being a word by the internal associative strength among characters involved. In this paper, the authors at first test the performance of nine widely adopted statistical measures of such kind in Chinese word extraction on the individual basis, then try the possibility of improving the performance by properly combining these measures. Genetic algorithm is explored to automatically adjust the weighting of combination. Experiments focusing on two-character Chinese word extraction show that mutual information is most powerful in these measures, achieving the F-measure 54 77%, and the effectiveness of combination is not significant, only achieving the F-measure 55 47%. This suggests that these measures could not supplement well each other, and the simplest and effective way in Chinese word extraction would be using mutual information directly.

关 键 词:计算机应用 中文信息处理 自动抽词 统计量的组合 遗传算法 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象