基于DC-Value的西班牙语文本词语提取方法被引量：1

Extracting Terms from Spanish Corpora Based on DC-Value

作　　者：于娟颜煜铃简梓炜张晨 YU Juan;YAN Yu-Ling;JIAN Zi-Wei;ZHANG Chen(School of Economics and Management,Fuzhou University,Fuzhou 350108,China)

机构地区：[1]福州大学经济与管理学院,福州350108

出　　处：《计算机系统应用》2021年第6期271-277,共7页Computer Systems & Applications

基　　金：国家自然科学基金(71771054)。

摘　　要：西班牙语(以下简称西语)是仅次于汉语的世界第二大母语语言,是联合国6种官方语言之一.西语复杂的词形变化和语法规则,导致C-value等经典的词语提取方法的效果无法保证,进而影响基于西语文本挖掘的效果.因此,本文研究西语文本词语提取方法,为西语文本的结构化建模提供完备的词库.给定待分析的西班牙语文本,该方法分3步提取得到词语集合:文本预处理、候选词语提取和DC-value成词度计算.其中,前两步所得的候选词语集合可直接用作文本挖掘的词库;第三步所得的候选词语成词度可辅助判断候选词语成词的可能性,减轻人工判断的工作量.实验结果表明,本文方法自动提取的西文词语集合的准确率达到80%,且召回率远高于经典方法,能够为西语文本挖掘提供有效的词库.As one of the six working languages of the United Nations and a major mother tongue second only to Chinese,Spanish has complex morphological changes and grammatical rules.These result in the inability of classic term extraction methods such as C-value and thus affect the effect of Spanish text analysis.This study proposes a Spanish term extraction method to automatically construct a complete lexicon for text modeling.Given a Spanish text or corpus,the method extracts terms in three steps:preprocessing the texts,extracting candidate terms,and calculating term-hood indexes of the candidate terms based on DC-value.The set of candidate terms obtained in the first two steps can be used directly as the lexicon for text mining.Meanwhile,the term-hood indexes obtained in the third step are essential for reducing the manual workload in determining whether the candidates are really terms.According to experiments,the proposed method has a high accuracy of 80%and a recall much higher than that of classic methods,providing the effective lexicon for Spanish text mining.

关键词：西语文本文本挖掘词语提取 DC-value

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于DC-Value的西班牙语文本词语提取方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于DC-Value的西班牙语文本词语提取方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于DC-Value的西班牙语文本词语提取方法被引量：1