基于改进C-value方法的中文术语抽取  被引量:23

Chinese Term Extraction Based on Improved C-value Method

在线阅读下载全文

作  者:胡阿沛[1] 张静[1] 刘俊丽[1] 

机构地区:[1]中国科学技术信息研究所,北京100038

出  处:《现代图书情报技术》2013年第2期24-29,共6页New Technology of Library and Information Service

摘  要:提出一种改进C-value的术语抽取方法,即IC-value方法。利用停用词对文本进行预处理后,采用一种基于串频统计的抽取算法提取候选术语;对候选术语进行语言规则过滤;从逆文档频率、破碎子串和术语长度三个方面改进C-value方法得到IC-value方法,并用来计算候选术语的术语度。以1 000篇乙型肝炎相关论文摘要进行实证研究,结果证明IC-value方法在准确率和召回率方面都要优于C-value、TF-IDF和V-value,有较强的长术语发现能力,且识别破碎子串的效果十分明显。An improved C -value term extraction method is introduced in the paper. Firstly, the domain -specific text corpora is preprocessed by stop word list. Secondly, a term extraction algorithm based on the co - occurrence frequency of multi -character is applied to get candidate terms. Lastly, term selection is completed based on termbood computed by IC - value which is the improvement of C - value in terms of inverse document frequency, meaningless substring and term length. Empirical study is conducted based on 1 000 abstracts of articles about Hepatitis B. The results indicate the pro- posed IC - value is much better than C - value, TF - IDF and V - value in both precision and recall. And IC - value also has good performance in long term extraction and it is very effective in filtering meaningless substring.

关 键 词:术语抽取 串频统计 语言规则 术语度 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象