文献被引片段特征分析与识别研究  被引量:6

Recognizing and Analyzing Cited Spans in Literature

在线阅读下载全文

作  者:徐健[1] 李纲[1] 毛进[1] 叶光辉[2] 

机构地区:[1]武汉大学信息资源研究中心,武汉430072 [2]华中师范大学信息管理学院,武汉430079

出  处:《数据分析与知识发现》2017年第11期37-45,共9页Data Analysis and Knowledge Discovery

摘  要:【目的】对科技文献领域的被引片段概念的特征进行分析,并比较不同识别方法效果的差异。【方法】以CL-Sci Summ 2016比赛被引片段标注数据为例,探索被引片段长度、位置与重要性特征,并分析与其对应引文上下文在长度和位置上的相关性。之后以基于词袋模型、主题模型、Word Net语义词典的相似性算法为例,比较这些方法在被引片段识别中的效果差异。【结果】研究结果发现:被标注的被引片段有96%少于三句,且更多地出现在文章前部和章节内的前部分,被引片段的Text Rank权重均值显著高于其他片段;被引片段与引文上下文在长度上显著相关,但在出现位置上相关性不明显;无论从MMR?还是句子与词汇层面的匹配度来看,基于词袋模型的识别方法效果均优于基于语义词典的方法,而后者明显优于基于主题模型的方法。【局限】对于被引片段概念与特性的分析只停留在理论层面,对其特征的分析与有关识别方法的比较也只是在CL-Sci Summ 2016被引片段标注数据上进行的。【结论】科技文献的用词比较规范严谨,所以词汇特征在被引片段的识别过程中起到关键的作用。[Objective] This paper analyzes features of the cited document spans, and compares the effectiveness of several recognization techniques. [Methods] Firstly, we analyzed the annotated data of cited spans from CL-Sci Summ 2016 for their length and position features as well as correlations with citation contexts. Then, we compared the effectiveness of bag-of-words, topic model, semantic dictionary(Word Net) methods by their performance of recognizing cited spans. [Results] We found that 96% of the annotated cited spans were less than three sentences, and most of the cited spans occurred in the front part of the whole paper or each chapter. The average Text Rank weight of these cited spans was significantly higher than that of the regular spans. The length of these cited spans was correlated to the length of their corresponding sections, however, there was no obvious ties with the position features. The method based on bag-of-words was the most effective one, followed by the methods based on semantic similarity and topic model. [Limitations] Our discussion on the conception and characteristics of the cited spans are in theory. All data analysis was done with the annotation dataset of CL-Sci Summ 2016. [Conclusions] The choice of words in scientific literature is very formal and rigorous, which makes the lexical features play an important role in recognizing the cited spans.

关 键 词:被引片段 识别方法 引文上下文 引用对象 

分 类 号:G353.1[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象