基于深度随机森林的商品类超短文本分类研究被引量：6

Research on Classification of Commodity Ultra-Short Text Based on Deep Random Forest

作　　者：牛振东[1] 石鹏飞朱一凡张思凡 NIU Zhendong;SHI Pengfei;ZHU Yifan;ZHANG Sifan(School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China)

机构地区：[1]北京理工大学计算机学院,北京100081

出　　处：《北京理工大学学报》2021年第12期1277-1285,共9页Transactions of Beijing Institute of Technology

基　　金：国家自然科学基金资助项目(61370137);教育部中国移动研究基金资助项目(2016/27);国家“九七三”计划项目(2012CB720700)。

摘　　要：近年来,随着移动通信和信息技术的发展,网络上和实际应用场景中需要处理越来越多的长度不超过20字并且不带有辅助标签信息的超短文本数据.超短文本因其固有的词义多义性、文本特征极度稀疏、上下文明显缺失以及明辨语义困难等特点,如何对其进行有效地分类成为文本分类领域亟需解决的新问题.本文针对传统的短文本分类方法KNN和决策树在商品类超短文本上存在的由于特征稀少而导致分类器性能不佳的问题,提出了一种基于深度随机森林的商品类超短文本分类方法.该方法采用“分流”策略,利用外部知识库进行辅助,对知识库中存在明确类别的商品名直接确定其分类,对无法直接抽取类别的商品名,采用Word2vec对其在外部知识库中的描述进行向量化,并利用深度随机森林对向量进行分类,同时不断优化分类器直到训练集大小达到设定的阈值.实验结果表明,与传统的分类方法KNN和决策树相比,本文提出的分类方法在平均准确率上分别提高了22.78%和17.22%,平均召回率上分别提高了22.85%和15.23%.In recent years,with the development of mobile communication and information technology,more and more ultra-short text data with a length of no more than 20 words and no auxiliary tag information need to be processed on the network and in actual application scenarios.Because of inherent ambiguity and feature sparseness of ultra-short text,obvious lack of context,and difficulty in distinguishing semantics,an effective classification method is needed in the field of text categorization.To solve the performance problem of those classifiers based on the traditional short text classification method KNN and the decision tree,a new method was proposed based on deep random forest for the classification of commodity short texts.Using a“diversion”strategy and taking an external knowledge base as assistance,the method was arranged to directly determine the commodity name with the clear category in the knowledge base,and to vectorize the description of the incapable extracted commodity name based on a Word2vec tool.And then the vectors in the external knowledge base were classified according to deep random forest.Finally,the classifier was continually optimized until the threshold of training set size was reached.The experimental results show that compared with the traditional classification method KNN and decision tree,the classification method proposed in this paper can improve the average accuracy by 22.78%and 17.22%,and the average recall rate by 22.85%and 15.23%respectively.

关键词：超短文本分类商品名称深度随机森林

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于深度随机森林的商品类超短文本分类研究被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于深度随机森林的商品类超短文本分类研究 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于深度随机森林的商品类超短文本分类研究被引量：6