基于DC-Value的西班牙语文本词语提取方法  被引量:1

Extracting Terms from Spanish Corpora Based on DC-Value

在线阅读下载全文

作  者:于娟 颜煜铃 简梓炜 张晨 YU Juan;YAN Yu-Ling;JIAN Zi-Wei;ZHANG Chen(School of Economics and Management,Fuzhou University,Fuzhou 350108,China)

机构地区:[1]福州大学经济与管理学院,福州350108

出  处:《计算机系统应用》2021年第6期271-277,共7页Computer Systems & Applications

基  金:国家自然科学基金(71771054)。

摘  要:西班牙语(以下简称西语)是仅次于汉语的世界第二大母语语言,是联合国6种官方语言之一.西语复杂的词形变化和语法规则,导致C-value等经典的词语提取方法的效果无法保证,进而影响基于西语文本挖掘的效果.因此,本文研究西语文本词语提取方法,为西语文本的结构化建模提供完备的词库.给定待分析的西班牙语文本,该方法分3步提取得到词语集合:文本预处理、候选词语提取和DC-value成词度计算.其中,前两步所得的候选词语集合可直接用作文本挖掘的词库;第三步所得的候选词语成词度可辅助判断候选词语成词的可能性,减轻人工判断的工作量.实验结果表明,本文方法自动提取的西文词语集合的准确率达到80%,且召回率远高于经典方法,能够为西语文本挖掘提供有效的词库.As one of the six working languages of the United Nations and a major mother tongue second only to Chinese,Spanish has complex morphological changes and grammatical rules.These result in the inability of classic term extraction methods such as C-value and thus affect the effect of Spanish text analysis.This study proposes a Spanish term extraction method to automatically construct a complete lexicon for text modeling.Given a Spanish text or corpus,the method extracts terms in three steps:preprocessing the texts,extracting candidate terms,and calculating term-hood indexes of the candidate terms based on DC-value.The set of candidate terms obtained in the first two steps can be used directly as the lexicon for text mining.Meanwhile,the term-hood indexes obtained in the third step are essential for reducing the manual workload in determining whether the candidates are really terms.According to experiments,the proposed method has a high accuracy of 80%and a recall much higher than that of classic methods,providing the effective lexicon for Spanish text mining.

关 键 词:西语文本 文本挖掘 词语提取 DC-value 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象