检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:郝洺 徐博[2] 殷绪成[1] 王方圆[2] HAO Ming1, XU Bo2 ,YIN Xu-Cheng1, WANG Fang-Yuan2(1. School of Computer and Communication Engineering, Uni- versity of Science and Technology Beijing, Beijing 100083 2. Research Center of Digital Content Technology and Service, In- stitute of Automation, Chinese Academy of Sciences, Beijing 10019)
机构地区:[1]北京科技大学计算机与通信工程学院,北京100083 [2]中国科学院自动化研究所数字内容技术与眼务研究中心,北京100190
出 处:《自动化学报》2018年第3期453-460,共8页Acta Automatica Sinica
摘 要:识别短文本的语言种类是社交媒体中自然语言处理的重要前提,也是一个挑战性热点课题.由于存在集外词和不同语种相同词汇干扰的问题,传统基于n-gram的短文本语种识别方法 (如Textcat、LIGA、log LIGA等)识别效果在不同的数据集上相差甚远,鲁棒性较差.本文提出了一种基于n-gram频率语种识别改进方法,根据训练数据不同特性,自动确定语言中特征词和共有词的权重,增强语种识别模型在不同数据集上的鲁棒性.实验结果证明了该方法的有效性.Language identification of short text is not only an important prerequisite for social media in natural language processing but also a challenging hot-topic. Due to the existence of foreign words and the same lexical interference in different languages, the effect of the tranditional n-gram based short text recognition method (eg Textcat, LIGA, logLIGA, etc.) is different in different datasets and robustness is poor. This paper presents an improved method based on n-gram frequency, which, according to the different characteristics of training data, can automatically determine the right language feature words and public words' weight, so as to enhance the language identification modeY robustness on different data sets. Experimental results demonstrate the effectiveness of this method.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28