基于n-gram频率的语种识别改进方法  被引量:6

Improve Language Identification Method by Means of n-gram Frequency

在线阅读下载全文

作  者:郝洺 徐博[2] 殷绪成[1] 王方圆[2] HAO Ming1, XU Bo2 ,YIN Xu-Cheng1, WANG Fang-Yuan2(1. School of Computer and Communication Engineering, Uni- versity of Science and Technology Beijing, Beijing 100083 2. Research Center of Digital Content Technology and Service, In- stitute of Automation, Chinese Academy of Sciences, Beijing 10019)

机构地区:[1]北京科技大学计算机与通信工程学院,北京100083 [2]中国科学院自动化研究所数字内容技术与眼务研究中心,北京100190

出  处:《自动化学报》2018年第3期453-460,共8页Acta Automatica Sinica

摘  要:识别短文本的语言种类是社交媒体中自然语言处理的重要前提,也是一个挑战性热点课题.由于存在集外词和不同语种相同词汇干扰的问题,传统基于n-gram的短文本语种识别方法 (如Textcat、LIGA、log LIGA等)识别效果在不同的数据集上相差甚远,鲁棒性较差.本文提出了一种基于n-gram频率语种识别改进方法,根据训练数据不同特性,自动确定语言中特征词和共有词的权重,增强语种识别模型在不同数据集上的鲁棒性.实验结果证明了该方法的有效性.Language identification of short text is not only an important prerequisite for social media in natural language processing but also a challenging hot-topic. Due to the existence of foreign words and the same lexical interference in different languages, the effect of the tranditional n-gram based short text recognition method (eg Textcat, LIGA, logLIGA, etc.) is different in different datasets and robustness is poor. This paper presents an improved method based on n-gram frequency, which, according to the different characteristics of training data, can automatically determine the right language feature words and public words' weight, so as to enhance the language identification modeY robustness on different data sets. Experimental results demonstrate the effectiveness of this method.

关 键 词:语种识别 短文本 n—gram频率 鲁棒性 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象