面向藏文文本的人物关系抽取语料库的构建  

Construction and Research of a Character Relationship Extraction Corpus for Tibetan Texts

在线阅读下载全文

作  者:德吉措 安见才让[1,2,3] De Jicuo;Anjian Cairang(School of Computer Science,Qinghai Minzu University,Xining 810007,China;Qinghai Key Laboratory of Tibetan Information Processing and Machine Translation,Xining 810007,China;State Key Laboratory of Intelligent Information Processing and Application of Tibetan Language,Jointly Established by the Ministry of Provincial Affairs,Xining 81007,China)

机构地区:[1]青海民族大学计算机学院,西宁810007 [2]青海省藏文信息处理与机器翻译重点实验室,西宁810007 [3]省部共建藏语智能信息处理及应用国家重点实验室,西宁810007

出  处:《青海科技》2024年第1期81-86,107,共7页Qinghai Science and Technology

摘  要:作为实体关系抽取研究的重要基础,构建高质量、标准化的语料库能够提高实体关系抽取任务的精确度和召回率。目前,藏文关系抽取语料库构建大多依靠传统人工标注方法且局限于特定领域,存在标注效率低且人物关系语料库相对缺乏的问题。文章构建了藏文人名实体识别语料库;通过分析人物关系特征和实体关系类别及其标注规范,构建触发词词典进行语料回标,生成15400条实体识别和8000条藏文人物关系抽取标注语料。为验证语料库的可用性,利用命名实体识别和关系抽取实验进行统计分析,其实体识别F1值达到67.2%,关系抽取F1值达到66.2%,结果表明该语料库的构建对后续面向藏文人物关系抽取研究提供了数据基础。As the important foundation of entity relationship extraction research,the construction of a high-quality,standardized corpus can improve the precision and recall of the entity relationship extraction task.At present,the construction of Tibetan relationship extraction corpus mostly relies on traditional manual annotation methods and is limited to specific domains,which has the problems of low annotation efficiency and relative lack of person relationship corpus.Therefore,this paper constructs a Tibetan person-entity recognition corpus;by analyzing person-relationship features and entity-relationship categories and their annotation specifications,and constructing a trigger word dictionary for corpus back-labeling,it generates 15400 entity-recognition and 8000 Tibetan person-relationship extraction annotated corpora.In order to verify the usability of the corpus,the named entity recognition and relationship extraction experiments are utilized for statistical analysis,and its entity recognition F1 value reaches 67.2%,and its relationship extraction F1 value reaches 66.2%,which shows that the construction of this corpus provides a data basis for the subsequent research oriented to the Tibetan character relationship extraction.

关 键 词:语料库 人物关系抽取 藏文文本 触发词 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象