检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:高定国 杨晓龙 杨宇帆 取次 高红梅 GAO Dingguo;YANG Xiaolong;YANG Yufan;Quci;GAO Hongmei(School of Information Science and Technology,Tibet University,Lhasa 850000,China)
机构地区:[1]西藏大学信息科学技术学院,西藏拉萨850000
出 处:《高原科学研究》2022年第1期82-89,共8页Plateau Science Research
基 金:国家自然科学基金项目(6266038);国家语委科研重点项目(ZDI135-118);2021年度自治区一流课程建设项目。
摘 要:藏文分词是藏文信息处理中关键的基础性工作,是机器翻译、智能检索、自然语言理解等智能信息处理的前提。藏文作为“少数民族语言分词技术评测MLWS2021”的一种评测语种,在MLWS2017的基础上,语料从新闻类单一语料扩展为新闻、法律、经济、小说和语言文字等多领域综合语料,训练语料和测试语料的质和量都有了较大的提升。文章介绍MLWS2021中藏文分词评测语料的构成、收集、整理情况;再分析藏文分词评测分析软件设计思想的基础上,针对测试语料的多样性,设计了“文本对比”和“藏文评测分析”软件,按需建设评测软件测试语料并测试证明了软件的正确性;最后,在不破坏评测语料的基础上,对语料进行预处理和测试,给出了参赛队不同模型的藏文分词评测结果并验证了结果的正确性。Tibetan word segmentation is a key and basic work in Tibetan information processing, and is the premise of intelligent information processing such as machine translation, intelligent retrieval, and natural language understanding. Tibetan is one of evaluation languages of“Evaluation dataset of Word Segmentation technology in Minority Languages”(MLWS2021), which is developed on the basis of MLWS2017. In MLWS2021, corpus has expanded from a single corpus of news to a comprehensive corpus in many fields such as news, law, economics,fiction and language, and the quality and quantity of training corpus and test corpus have been greatly improved.In this paper, firstly, the composition, collection and collation of the Tibetan word segmentation evaluation corpus of MLWS2021 are introduced;and then, "text comparison" and "Tibetan evaluation and analysis" software are proposed on the basis of re-analysis the design ideas of the Tibetan word segmentation evaluation and analysis software and aiming at the diversity of the test corpus. Furthermore, the evaluation software test corpus is constructed on demand and the correctness of the software is verified. Finally, without destroying the evaluation corpus, the corpus is pre-processed and tested, and the Tibetan word segmentation evaluation results of differert modles and the correctness of the results is verified.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.200