检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:牛振东[1] 石鹏飞 朱一凡 张思凡 NIU Zhendong;SHI Pengfei;ZHU Yifan;ZHANG Sifan(School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China)
出 处:《北京理工大学学报》2021年第12期1277-1285,共9页Transactions of Beijing Institute of Technology
基 金:国家自然科学基金资助项目(61370137);教育部中国移动研究基金资助项目(2016/27);国家“九七三”计划项目(2012CB720700)。
摘 要:近年来,随着移动通信和信息技术的发展,网络上和实际应用场景中需要处理越来越多的长度不超过20字并且不带有辅助标签信息的超短文本数据.超短文本因其固有的词义多义性、文本特征极度稀疏、上下文明显缺失以及明辨语义困难等特点,如何对其进行有效地分类成为文本分类领域亟需解决的新问题.本文针对传统的短文本分类方法KNN和决策树在商品类超短文本上存在的由于特征稀少而导致分类器性能不佳的问题,提出了一种基于深度随机森林的商品类超短文本分类方法.该方法采用“分流”策略,利用外部知识库进行辅助,对知识库中存在明确类别的商品名直接确定其分类,对无法直接抽取类别的商品名,采用Word2vec对其在外部知识库中的描述进行向量化,并利用深度随机森林对向量进行分类,同时不断优化分类器直到训练集大小达到设定的阈值.实验结果表明,与传统的分类方法KNN和决策树相比,本文提出的分类方法在平均准确率上分别提高了22.78%和17.22%,平均召回率上分别提高了22.85%和15.23%.In recent years,with the development of mobile communication and information technology,more and more ultra-short text data with a length of no more than 20 words and no auxiliary tag information need to be processed on the network and in actual application scenarios.Because of inherent ambiguity and feature sparseness of ultra-short text,obvious lack of context,and difficulty in distinguishing semantics,an effective classification method is needed in the field of text categorization.To solve the performance problem of those classifiers based on the traditional short text classification method KNN and the decision tree,a new method was proposed based on deep random forest for the classification of commodity short texts.Using a“diversion”strategy and taking an external knowledge base as assistance,the method was arranged to directly determine the commodity name with the clear category in the knowledge base,and to vectorize the description of the incapable extracted commodity name based on a Word2vec tool.And then the vectors in the external knowledge base were classified according to deep random forest.Finally,the classifier was continually optimized until the threshold of training set size was reached.The experimental results show that compared with the traditional classification method KNN and decision tree,the classification method proposed in this paper can improve the average accuracy by 22.78%and 17.22%,and the average recall rate by 22.85%and 15.23%respectively.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.215