从句子图到篇章图——基于抽象语义表示的篇章级共指标注体系研究  

From Sentence Graphs to Discourse Graphs:Designing a Discourse⁃level Coreference Annotation Framework Based on Abstract Meaning Representation

作  者:张艺璇 李斌[1,2] 许智星[1,2] Zhang Yi-xuan;Li Bin;Xu Zhi-xing(School of Chinese Language and Literature,Nanjing Normal University,Nanjing 210097,China;Center for Language Big Data and Computational Humanities,Nanjing Normal University,Nanjing 210097,China)

机构地区:[1]南京师范大学文学院,南京210097 [2]南京师范大学语言大数据与计算人文研究中心,南京210097

出  处:《外语学刊》2025年第1期19-28,共10页Foreign Language Research

基  金:国家社科重大项目“先秦诸子典籍知识库建设及词典编纂”(22&ZD262);教育部人文社科一般项目“基于大语言模型的古汉语词义知识库构建”(24A10319028)的阶段性成果。

摘  要:篇章级共指关系是语言学和计算语言学的研究难点之一。本文在梳理共指理论研究与趋势的基础上,回顾共指语料库的构建与自动解析方法,指出共指语料的构建主要存在以下两个问题:共指关系的标注较为粗疏,也基本不考虑与句子语义结构本身的关系。本文在句子级语义标注体系(中文抽象语义表示)的基础上,设计篇章共指的标注体系,以“概念同一性”为基本原则,从词形的异同和概念的表述角度区分9种篇章共指关系,标注了500个篇章的共指信息。与已完整标注的52种句内语义关系相结合,构建出带有篇章共指信息的篇章抽象语义图库。该语料库选自CTB新闻语料,体裁涵盖经济、体育及生活类,规模为6237句,16万词例。该语料库的构建为篇章级语义分析提供了新框架与数据资源。Discourse⁃level coreference is a challenging research area in both linguistics and computational linguistics.This paper reviews coreference theories and their development trends,with a focus on the construction of coreference corpus and automatic resolution methods.We pointed out two main issues in the construction of coreference corpus:the annotation of coreference relationships tends to be coarse⁃grained,and the relationships between coreference and sentence⁃level semantic structures are largely neglected.To address these gaps,this study designs a discourse⁃level coreference annotation framework based on the sentence⁃level semantic annotation framework Chinese Abstract Meaning Representation.Guided by the principle of“conceptual identity”,the framework categorizes nine types of discourse⁃level coreference relations from the perspectives of word type and concept consistency.Coreference information was annotated for 500 texts.By integrating 52 inner⁃sentence semantic relations already annotated,the study constructs a discourse abstract meaning graph enriched with discourse⁃level coreference information.The corpus is derived from the Chinese Treebank news corpus,covering economics,sports,and daily life,with a total size of 6,237 sentences and 163,227 word tokens.This corpus provides a novel framework and valuable data resources for discourse⁃level semantic analysis.

关 键 词:篇章共指 抽象语义表示 概念同一性 篇章语义结构 语料库 中文信息处理 

分 类 号:H08[语言文字—语言学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象