维吾尔语停用词抽取方法研究

Research on Uyghur Stop Words Extraction Method

作　　者：塞麦提·麦麦提敏司马义·阿不都热依木 SAIMAITI Maimaitimin;ESMAEL Abdurehim(Chinese Languages School,Xinjiang University,Urumqi 830046,China;Xinjiang Research Center for Chinese-Ethnic Languages Translation,Urumqi 830046,China)

机构地区：[1]新疆大学中国语言学院,乌鲁木齐830046 [2]新疆民汉语文翻译研究中心,乌鲁木齐830046

出　　处：《计算机工程》2019年第10期288-292,300,共6页Computer Engineering

基　　金：国家社会科学基金(17XYY034);教育部人文社会科学研究青年项目(16XJJC740001)

摘　　要：为提高信息处理效率,文本信息检索系统通常将停用词作为噪音过滤掉,影响了文本处理的效果。针对该问题,提出一种应用于维吾尔语的停用词抽取方法。在分析维吾尔语停用词特点的基础上,采用文档频数、词项频率和信息熵的方法对大量语料进行统计,并分析候选停用词的词性分布情况。通过文本分类实验确定停用词阈值,结果表明,使用该方法进行停用词过滤后,文本分类的计算复杂度降低,分类准确率达到80.8%。In order to improve the efficiency of information processing,the text information retrieval system usually filters out the stop words as noise,which affects the effect of text processing.Aiming at this problem,a stop words extraction method in Uyghur language is proposed.On the basis of analyzing the characteristics of Uyghur stop words,the statistics on a large number of corpus is carried out by means of Document Frequency(DF),Term Frequency(TF)and Entropy(EN),and the part of speech distribution of candidate stop words is analyzed.The threshold of stop words is determined by text classification experiments.Experimental results show that after filtering stop words with the proposed method,the computational complexity of text classification is reduced,and the classification precision reaches 80.8%.

关键词：信息检索停用词维吾尔语文本分类语料统计

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

维吾尔语停用词抽取方法研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

维吾尔语停用词抽取方法研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索