基于融合矩阵的文本相似度计算实现检索结果聚类  被引量:1

A Fusion Matrix-based Study on Text Clustering of Document Retrieval Results

在线阅读下载全文

作  者:赵悦阳[1] 崔雷[2] ZHAO Yueyang;CUI Lei(Library of Shengjing Hospital of China Medical University,Shenyang 110004,China;School of Health Management,China Medical University,Shenyang 110122,China)

机构地区:[1]中国医科大学附属盛京医院图书馆,沈阳110004 [2]中国医科大学医学健康管理学院,沈阳110122

出  处:《医学信息学杂志》2024年第3期58-64,共7页Journal of Medical Informatics

基  金:辽宁省社会科学规划基金资助项目(项目编号:L20BTQ003)。

摘  要:目的/意义弥补医学文本语义表示方面的不足,实现PubMed数据库检索结果聚类。方法/过程采用Jaccard系数和TF-IDF构建融合矩阵方法,建立短语间、文档间、短语与文档内容间的相似性关系融合矩阵,训练聚类算法,将PubMed数据库检索结果集合分组,随后生成类别标签,描述每一类簇文档的含义。结果/结论基于融合矩阵的聚类效果较好,提取出描述类别的高频词能很好地区分类别含义,对检索结果文本聚类任务有效。Purpose/Significance To solve the deficiencies in the semantic representation of medical texts,and to realize the clustering of the retrieval results of the PubMed database.Method/Process The paper proposes a method to construct a fusion matrix by using the Jaccard coefficient and TF-IDF.Similarity relations between phrases,documents,and the contents of phrases and documents are combined to construct a fusion matrix,and several clustering algorithms are trained to group a collection of documents from the PubMed database.Category annotations are created to describe the meaning of each category of clustered documents.Result/Conclusion Experimental results show that the fusion matrix-based clustering is superior in grouping the document sets,and the extracted high-frequency words in the category descriptions distinguish the meanings of the categories well,so the fusion matrix design is effective for clustering descriptions of academic texts.

关 键 词:文献检索 文本聚类 融合矩阵 文本相似度 

分 类 号:R-058[医药卫生]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象