基于数学表达式特征的科技文档检索模型  被引量:1

A retrieval model of scientific documents based on mathematical expression features

在线阅读下载全文

作  者:田学东[1] 崔晓娟 

机构地区:[1]河北大学计算机科学与技术学院,河北保定071002

出  处:《河北大学学报(自然科学版)》2017年第6期652-661,共10页Journal of Hebei University(Natural Science Edition)

基  金:国家自然科学基金资助项目(61375075);河北省教育厅河北省高等学校科学技术研究重点项目(ZD2017208)

摘  要:现有全文检索技术多是以文本信息为处理对象,对于以数学表达式为主要成分的科技文档检索还处在探索阶段.为了使用户可以方便地以数学公式作为查询语言对科技文档进行检索,提出了一种基于数学表达式特征的科技文档检索模型.首先通过将公式解析为二叉树得到数学表达式的子式信息,利用数学表达式及子式构造检索特征向量;在索引阶段,利用所提取的文档特征向量构建分层结构的索引表;在匹配阶段,对文档向量采用tf-idf进行加权操作,利用余弦相似度对检索向量和文档向量进行相似度计算,得到一个有序的文档检索结果.实验选取了来自不同领域的期刊、学术网站以及公共数据集的5 017篇科技文档,其中包含了96 362条数学公式,平均检索时间为0.428s,表明该模型达到了实现较高效率科技文档检索的目标.The existing full-text retrieval techniques are mostly targeting the text information.While the retrieval of the scientific documents with mathematical expressions as the main components is still in the exploration stage.In order to make the users can easily use the mathematical formula as the query language to retrieve the scientific and technical documents,a new scientific document retrieval model based on mathematical expression features was proposed.Firstly,the formulas were resolved into the subformulas forming the binary trees,which are used to generate the retrieval feature vectors.In the index phase,the index table with the hierarchical structure was constructed using the extracted document feature vectors.In the retrieval phase,the document vectors were weighted by tf-idf.The similarity between the retrieval vector and the document vector was calculated by using the cosine similarity,and an ordered document retrieval result was obtained.The experiment data was selected from different journals,academic website and public data set of 5 017 science and technology documents which contain 96 362 mathematical formulas.The average retrieval time was 0.428 s,which indicates that the proposed model achieved the goal of realizing mathematical expression retrieval with high efficiency.

关 键 词:科技文档 数学表达式 检索 索引 匹配 二叉树 特征 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象