检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:麦热哈巴·艾力[1,2] 姜文斌[2,3] 王志洋[2,3] 吐尔根·依布拉音[1] 刘群[2]
机构地区:[1]新疆大学信息科学与工程学院,新疆乌鲁木齐830046 [2]中国科学院计算技术研究所,北京100190 [3]中国科学院研究生院,北京100049
出 处:《软件学报》2012年第12期3115-3129,共15页Journal of Software
基 金:国家自然科学基金(61063026);国家社会科学基金(10AYY006);国家工信部电子发展基金(工信部财(2009)553);新疆高校青年教师科研培养基金(XJEDU2010S07);新疆大学优秀博士创新项目基金
摘 要:维吾尔语是典型的黏着性语言,其派生能力很强,具有丰富的形态变化,同时遵循语音和谐规律,生成过程中会出现弱化、增音、脱落等音变现象.这些特性决定了维吾尔语词法分析的难点,包括词干提取、发生音变字母的还原以及标注.将维吾尔语词的层次结构引入到词法分析研究中,提出了维吾尔语词法分析的有向图模型,该模型将维吾尔语词法分析描述为有向图结构,图中节点表示词干、词缀及其相应标注,其边表示节点之间的转移或生成概率并将此概率作为候选择优的依据.针对维吾尔语在形态变化过程中发生的音变现象,又提出基于词内字母对齐算法的自动还原模型,该模型将音变现象泛化到每个字母上的假设之下,将还原问题转变成类似于词性标注问题,再利用统计方法进行还原.在对新疆多语种信息技术重点实验室手工标注的《维吾尔语百万词词法分析语料库》上进行的实验中,取得了词干提取正确率为94.7%,词干与各词缀切分并标注的F值达到92.6%的好成绩.Uyghur is a typical agglutinative language. It has a strong derivational ability with very a rich morphological structure and follows a harmonious rule. In the formation process, some phenomena may occur such as weakened, increased tone and fallen tone. The specific character of Uyghur language determines the difficulty of the Uyghur morphological analysis, including stemming and restoring the changed letter and POS tagging. This paper employs the hierarchical structure of Uyghur word, and proposes a directed graph model for Uyghur morphological analysis. In this model, words and tags are described as a directed graph. In this graph, nodes represent stems, affixes and their corresponding tags, while edges represent the transition, or general probabilities between nodes. Aimed at providing some light on the phenomenon of morphological sandhi in Uyghur language, this paper also proposes a restore model by changing the word to its original form. With the assumption that one letter can be changed to any letter, this model converts restoring problem into a sequence labeling problem, which could be solved by statistical methods. Experiment results on "Mega-words Corpus of Morphological Analysis of Uyghur", which is manually annotated by Xinjiang multilingual key laboratory shows that the accuracy of stemming reaches 94.7%. and the F score of stem and affix in line with tag reaches 92.6%.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.116.241.205