基于自编码器语义哈希的大规模文本预处理被引量：3

Large Scale Text Preprocessing Based on Self-Encoder Semantic Hashing

作　　者：张忠林[1] 杨朴舟 ZHANG Zhong-lin;YANG Pu-zhou(Lanzhou Jiaotong University,Lanzhou Gansu 730000,China)

出　　处：《计算机仿真》2019年第3期225-229,260,共6页Computer Simulation

基　　金：国家自然科学基金(61662043)

摘　　要：展示了一种从大规模文本中学习文本索引的深层图形模型,深层图形模型采用自编码器作为基础结构。该图模型最终输出的值具有较强的解释性,并且比潜在语义索引更好地表示每个文档。当最深层使用少数二进制变量输出时(例如32位),图形模型将文档通过语义散列的方式映射到存储器对应的地址上,使得语义上相似的文档位于附近的地址处。然后可以通过访问所有仅相差几位的地址来找到类似于查询文本的文本。通过查询文件地址的方式,基于近似匹配方式的散列编码的效率比局部敏感散列快得多,通过使用语义哈希来过滤采用TF-IDF表示的文本,将实现更高的准确性。This article shows a deep graphics model for learning text indexes from large-scale text,using a self-encoder as the underlying structure.The final output of this graph model is strongly explanatory and represents each document better than a potential semantic index.When the deepest output is using a few binary variables(for example,32 bits),the graphical model maps the document through semantic hashing to the address corresponding to the memory so that semantically similar documents are located at nearby addresses.You can then find text similar to the query text by accessing all addresses that differ only by a few digits.By querying file addresses,this proximity-based hashing code is much more efficient than locally-sensitive hashing,and by using semantic hashing to filter text represented by TF-IDF,we achieve higher accuracy.

关键词：自编码器语义哈希潜在语义索引文本索引

分类号：TP183[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于自编码器语义哈希的大规模文本预处理被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于自编码器语义哈希的大规模文本预处理 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于自编码器语义哈希的大规模文本预处理被引量：3