潜在语义索引中特征优化技术的研究  被引量:7

Research on Feature Optimization in Latent Semantic Indexing

在线阅读下载全文

作  者:季铎[1] 郑伟[1] 蔡东风[1] 

机构地区:[1]沈阳航空工业学院知识工程中心,辽宁沈阳110034

出  处:《中文信息学报》2009年第2期69-76,共8页Journal of Chinese Information Processing

基  金:国家863计划课题资助项目(2006AA01Z148);教育部科学技术研究重点项目(207148)

摘  要:潜在语义索引被广泛应用于信息检索、文本分类、自动问答等领域中。潜在语义索引是一种降维方法,它把共现特征映射到同一维空间上,而非共现特征映射到不同的空间上。在潜在语义索引的语义空间中,共现特征通过文档内部以及文档之间的特征传递关系获得。该文认为这种特征传递关系会引入一些不存在的共现特征,从而降低潜在语义索引的性能,应该对这种特征传递关系进行一些选择,削除不存在的共现特征信息。该文采用文档频率对文档集合进行特征选择,用Complete—Link聚类算法在两个公开语料上进行三个实验,实验结果显示,保留文档频度的10%~15%时,其F1值分别提高了6.5770%,1.9928%和3.3614%。Latent Semantic Indexing (LSI) has been applied to many fields, such as information retrieval, text classification, automatic question answering and so on. Basically, LSI is a dimensionality reducing method by projecting term co-occurrences into the same space. Therefore, in the semantic space of LSI, term co-occurrences are obtained by the term transfer relation both in single document and between different documents. This paper suggests that this term transfer relation causes some nonexisted term co-occurrences, which reduce the performance of the LSI. To eliminate nonexistent term co-occurrences, this paper further adopts documents frequency to select features in document sets, and experiments with Complete-Link clustering algorithm on two public cocpora. The experimental results show that the F-measure of clustering increases by 6. 577 0%, 1. 992 8% and 3. 361 4% when documents frequency are reserved between 10% and 15%.

关 键 词:计算机应用 中文信息处理 潜在语义索引 共现特征 奇异值分解 特征选择 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象