检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:于娟 颜煜铃 简梓炜 张晨 YU Juan;YAN Yu-Ling;JIAN Zi-Wei;ZHANG Chen(School of Economics and Management,Fuzhou University,Fuzhou 350108,China)
出 处:《计算机系统应用》2021年第6期271-277,共7页Computer Systems & Applications
基 金:国家自然科学基金(71771054)。
摘 要:西班牙语(以下简称西语)是仅次于汉语的世界第二大母语语言,是联合国6种官方语言之一.西语复杂的词形变化和语法规则,导致C-value等经典的词语提取方法的效果无法保证,进而影响基于西语文本挖掘的效果.因此,本文研究西语文本词语提取方法,为西语文本的结构化建模提供完备的词库.给定待分析的西班牙语文本,该方法分3步提取得到词语集合:文本预处理、候选词语提取和DC-value成词度计算.其中,前两步所得的候选词语集合可直接用作文本挖掘的词库;第三步所得的候选词语成词度可辅助判断候选词语成词的可能性,减轻人工判断的工作量.实验结果表明,本文方法自动提取的西文词语集合的准确率达到80%,且召回率远高于经典方法,能够为西语文本挖掘提供有效的词库.As one of the six working languages of the United Nations and a major mother tongue second only to Chinese,Spanish has complex morphological changes and grammatical rules.These result in the inability of classic term extraction methods such as C-value and thus affect the effect of Spanish text analysis.This study proposes a Spanish term extraction method to automatically construct a complete lexicon for text modeling.Given a Spanish text or corpus,the method extracts terms in three steps:preprocessing the texts,extracting candidate terms,and calculating term-hood indexes of the candidate terms based on DC-value.The set of candidate terms obtained in the first two steps can be used directly as the lexicon for text mining.Meanwhile,the term-hood indexes obtained in the third step are essential for reducing the manual workload in determining whether the candidates are really terms.According to experiments,the proposed method has a high accuracy of 80%and a recall much higher than that of classic methods,providing the effective lexicon for Spanish text mining.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.179