基于字符的中文分词、词性标注和依存句法分析联合模型  被引量:14

Character-level Dependency Model for Joint Word Segmentation,POS Tagging,and Dependency Parsing in Chinese

在线阅读下载全文

作  者:郭振[1] 张玉洁[1] 苏晨[1] 徐金安[1] 

机构地区:[1]北京交通大学计算机与信息技术学院,北京100044

出  处:《中文信息学报》2014年第6期1-8,17,共9页Journal of Chinese Information Processing

基  金:国家国际科技合作专项资助(2014DFA11350);国家自然科学基金(61370130);北京交通大学人才基金(KKRC11001532)

摘  要:目前,基于转移的中文分词、词性标注和依存句法分析联合模型存在两大问题:一是任务的融合方式有待改进;二是模型性能受限于全标注语料的规模。针对第一个问题,该文利用词语内部结构将基于词语的依存句法树扩展成了基于字符的依存句法树,采用转移策略,实现了基于字符的中文分词、词性标注和依存句法分析联合模型;依据序列标注的中文分词方法,将基于转移的中文分词处理方案重新设计为4种转移动作:Shift_S、Shift_B、Shift_M和Shift_E,同时能够将以往中文分词的研究成果融入联合模型。针对第二个问题,该文使用具有部分标注信息的语料,从中抽取字符串层面的n-gram特征和结构层面的依存子树特征融入联合模型,实现了半监督的中文分词、词性标注和依存句法分析联合模型。在宾州中文树库上的实验结果表明,该文的模型在中文分词、词性标注和依存分析任务上的F1值分别达到了98.31%、94.84%和81.71%,较单任务模型的结果分别提升了0.92%、1.77%和3.95%。其中,中文分词和词性标注在目前公布的研究结果中取得了最好成绩。Recent work on joint word segmentation, POS tagging, and dependency parsing in Chinese has two key problems: one is that the word segmentation based on character and the dependency parsing based on word are not well-combined in the transition-based framework; the other is that the current joint model suffers from the insuffi- ciency of annotated corpus. In order to resolve the first problem, we propose to transform the eonventional word- based dependency tree into character-based dependency tree by using the internal structure of words and then pro- pose a novel character-level joint model for the three tasks. For Chinese word segmentation, we design 4 transition actions: Shfit_S, ShiftB, Shift_M and Shift_E, through which the features used in previous researches can also be integrated into the model. In order to resolve the second problem, we propose a novel semi-supervised joint model for exploiting n-gram feature and dependency subtree feature from partially-annotated corpus. Experimental results on theChinese Treebank show that our joint model achieved the F1-scores of 98.31%, 94.84% and 81.71% for Chinese Word segmentation, POS tagging, and dependency parsing, respectively. Our model outperforms the pipe- line model in the three tasks by 0.92%, 1.77% and 3.95%, respectively. Especially, the F1 value of word segmen- tation and POS tagging achieved the best among the public results so far.

关 键 词:联合模型 中文分词和词性标注 依存句法分析 词语内部依存结构 半监督学习 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象