布茨定律用于中文同频词规律的实证研究  

Empirical Study on Applicability of Booth's Law for the Law of Same Frequency Words in Chinese Text

在线阅读下载全文

作  者:李晓超[1,2,3] 贾立国[4] 罗燕[1,2,3] 陈敏[1,2,3] 柳萌萌[1,2,3] 赵书良[1,2,3] 

机构地区:[1]河北师范大学数学与信息科学学院,石家庄050024 [2]河北师范大学河北省计算数学与应用重点实验室,石家庄050024 [3]河北师范大学移动物联网研究院,石家庄050024 [4]河北师范大学教务处,石家庄050024

出  处:《情报杂志》2015年第6期62-67,共6页Journal of Intelligence

摘  要:布茨定律反映了英文文本同频词的分布规律,但布茨定律是否适用于中文文本很少有学者对其进行深入研究。为了探究布茨定律对于中文文本的适用性,揭示中文文本同频词的统计规律,对大量中文文本同频词进行统计研究,实验过程中注重了实验数据规模的选取和文本长度跨度的设计。实验得出:随着文本长度的增大,低频词的同频词数与不同词数的比值并非定值,而是逐渐减小;低频词的同频词数与不同词数的关系呈幂函数增长。另外,随着文本长度的增大,低频词的同频词数与频次为1的同频词数的比值也非定值,而是逐渐增大。上述结果与布茨所做英文的实验不一致,故得出结论:布茨定律不适用于中文文本。Booth's law reflects the rule of the same frequency words in English text. But there few of scholars give some research of whether Booth' s law can fit Chinese text well. In order to explore the law of Booth' s applicability of the Chinese text, discover the statisti- cal rules of Chinese text with t frequency words, in this paper, a large number of Chinese text carried on the statistical study with frequency words, pay attention to the selection of the size of the experimental date in the process of the experiment and the design of the span length of the text. Experiments shows that along with the increa~ of length of the text, the low-frequency words of the ratio of different number of words is not a fixed value, but the decreases is gradually; The low-frequency words and the different function words are in growth. In addition, the ratio of the number of words occurringn times and words occurring once is not a fixed value, too, but increases gradually with the length of the article growth. The results are inconsistent with Booth' s experimental results on English text. So we give a conclu- sion that Booth' s law doesn' t fit Chinese text.

关 键 词:同频词 齐普夫定律 布茨定律 低频词 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象