基于词共现有向图的中文合成词提取算法  被引量:4

Chinese Compound Word Extraction Algorithm Based on Word Co-occurrence Directed Graph

在线阅读下载全文

作  者:刘兴林[1,2] 郑启伦[1] 马千里[1] 

机构地区:[1]华南理工大学计算机科学与工程学院,广州510640 [2]五邑大学计算机学院,广东江门529020

出  处:《计算机工程》2011年第23期177-180,共4页Computer Engineering

基  金:广东省自然科学基金资助项目(9451064101003233;S2011010003681);广东省科技计划基金资助项目(2010B010600039);华南理工大学中央高校基本科研业务费基金资助项目(2009ZM0125;2009ZM0189;2009ZM0255)

摘  要:分词系统由于未将合成词收录进词典,因此不能识别合成词。针对该问题,提出一种基于词共现有向图的中文合成词提取算法。采用词性探测方法从文本中获取词串,由所获词串生成词共现有向图,并借鉴Bellman-Ford算法思想,从词共现有向图中搜索多源点长度最长且权重值满足给定条件的路径,该路径所对应的词串即为合成词。实验结果显示,该算法的合成词提取正确率达到91.16%。Word segmentation systems do not include compound words into their dictionaries,so they can not recognize compound words.To address this problem,this paper proposes a Chinese compound word extraction algorithm based on word co-occurrence graph.It gets word strings from a document through by part-of-speech detecting,generates word co-occurrence directed graph,,and borrows the idea of the Bellman-Ford algorithm to search the longest paths with weight values satisfy the given conditions for multiple starting points in the word co-occurrence directed graph.The word strings corresponding to the paths are considered as compound words.Experimental results show that the algorithm achieves 91.16% upon the precision.

关 键 词:合成词提取 词性探测 词共现有向图 自然语言处理 Bellman-Ford算法 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象