基于正态分布的词频分析法高频词阈值研究  被引量:61

The Research on the Threshold of High-Frequency Words Based on the Normal Distribution in Word Frequency Analysis

在线阅读下载全文

作  者:安兴茹[1] 

机构地区:[1]内蒙古科技大学图书馆,包头014010

出  处:《情报杂志》2014年第10期129-136,共8页Journal of Intelligence

摘  要:词频分析法高频关键词或主题词的界定是开展信息分析的重要基础。首先,在文献统计分析的基础上,总结了目前词频分析法高频词确定的四种方法:TOPN方法、WF>=M方法、%WF=P方法以及T计算方法,这些方法存在着经验性、随意性、理论基础和适用性上的问题。接着,通过实证方法,验证了关键词和主题词在文献库中的分布符合正态分布,并根据正态分布的特性,提出了词频分析法高频词阈值的F计算方法。最后,在多个数据样本基础上,将F方法与T方法进行了对比分析,认为基于正态分布的高频词阈值F计算方法在理论基础和适用性上都能达到较好的效果。Along with the outburst of information and the developing of information analysis,word frequency analysis is becoming more and more popular in which the defining of high-frequency words serves as the cornerstone.By summarizing the precedent literature researches,this paper first concluded four methods of defining high-frequency words at present,i.e.TOPN,WF = M,% WF = P and T formula.After briefly discussing the main and obvious shortcomings of the above four methods,such as depending on experience too much,subjectivity,lack of theoretical background,inapplicability or impracticability and so on,the paper empirically tested and verified the normal distribution of high-frequency words in depositories,and accordingly proposed the F formula for threshold analysis of high-frequency words.At the final part,the paper compared and contrasted the T formula and the F formula through the analysis of many datasets,and by doing this the F formula was theoretically and applicably legitimized in the research of threshold of high-frequency words based on normal distribution.

关 键 词:词频分析法 正态分布 高频词 齐普夫定律 

分 类 号:G350[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象