维吾尔语词法分析的有向图模型  被引量:22

Directed Graph Model of Uyghur Morphological Analysis

在线阅读下载全文

作  者:麦热哈巴·艾力[1,2] 姜文斌[2,3] 王志洋[2,3] 吐尔根·依布拉音[1] 刘群[2] 

机构地区:[1]新疆大学信息科学与工程学院,新疆乌鲁木齐830046 [2]中国科学院计算技术研究所,北京100190 [3]中国科学院研究生院,北京100049

出  处:《软件学报》2012年第12期3115-3129,共15页Journal of Software

基  金:国家自然科学基金(61063026);国家社会科学基金(10AYY006);国家工信部电子发展基金(工信部财(2009)553);新疆高校青年教师科研培养基金(XJEDU2010S07);新疆大学优秀博士创新项目基金

摘  要:维吾尔语是典型的黏着性语言,其派生能力很强,具有丰富的形态变化,同时遵循语音和谐规律,生成过程中会出现弱化、增音、脱落等音变现象.这些特性决定了维吾尔语词法分析的难点,包括词干提取、发生音变字母的还原以及标注.将维吾尔语词的层次结构引入到词法分析研究中,提出了维吾尔语词法分析的有向图模型,该模型将维吾尔语词法分析描述为有向图结构,图中节点表示词干、词缀及其相应标注,其边表示节点之间的转移或生成概率并将此概率作为候选择优的依据.针对维吾尔语在形态变化过程中发生的音变现象,又提出基于词内字母对齐算法的自动还原模型,该模型将音变现象泛化到每个字母上的假设之下,将还原问题转变成类似于词性标注问题,再利用统计方法进行还原.在对新疆多语种信息技术重点实验室手工标注的《维吾尔语百万词词法分析语料库》上进行的实验中,取得了词干提取正确率为94.7%,词干与各词缀切分并标注的F值达到92.6%的好成绩.Uyghur is a typical agglutinative language. It has a strong derivational ability with very a rich morphological structure and follows a harmonious rule. In the formation process, some phenomena may occur such as weakened, increased tone and fallen tone. The specific character of Uyghur language determines the difficulty of the Uyghur morphological analysis, including stemming and restoring the changed letter and POS tagging. This paper employs the hierarchical structure of Uyghur word, and proposes a directed graph model for Uyghur morphological analysis. In this model, words and tags are described as a directed graph. In this graph, nodes represent stems, affixes and their corresponding tags, while edges represent the transition, or general probabilities between nodes. Aimed at providing some light on the phenomenon of morphological sandhi in Uyghur language, this paper also proposes a restore model by changing the word to its original form. With the assumption that one letter can be changed to any letter, this model converts restoring problem into a sequence labeling problem, which could be solved by statistical methods. Experiment results on "Mega-words Corpus of Morphological Analysis of Uyghur", which is manually annotated by Xinjiang multilingual key laboratory shows that the accuracy of stemming reaches 94.7%. and the F score of stem and affix in line with tag reaches 92.6%.

关 键 词:维吾尔语 词法分析 词语切分 词性标注 有向图 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象