基于多特征融合的中文文本分类研究  被引量:6

Chinese Text Classification with Feature Fusion

在线阅读下载全文

作  者:王艳 王胡燕 余本功[2,3] Wang Yan;Wang Huyan;Yu Bengong(Economic and Technical College,Anhui Agricultural University,Hefei 231200,China;School of Management,Hefei University of Technology,Hefei 230009,China;Key Laboratory of Process Optimization&Intelligent Decision-Making,Ministry of Education,Hefei University of Technology,Hefei 230009,China)

机构地区:[1]安徽农业大学经济技术学院,合肥231200 [2]合肥工业大学管理学院,合肥230009 [3]合肥工业大学过程优化与智能决策教育部重点实验室,合肥230009

出  处:《数据分析与知识发现》2021年第10期1-14,共14页Data Analysis and Knowledge Discovery

基  金:国家自然科学基金项目(项目编号:71671057)的研究成果之一。

摘  要:【目的】通过结合拼音字符特征、汉字字符特征、词级别语义特征和词性特征,缓解文本所呈现出的弱结构化、拼写错误及其同音词较多的问题,丰富语义特征,提高模型的分类能力。【方法】多特征融合的文本分类方法,在词级别特征的基础上进行词性特征、汉字字符特征和拼音字符特征构建多特征语义表示,然后将特征输入BiGRU中获取上下文语义特征,输入CNN中获取局部语义特征,最终将特征进行融合并输入Softmax中进行分类,预测需要的类别标签。【结果】在两个不同的数据集下,多特征融合的模型的准确率分别达到83.3%和91.1%,比其他分类模型准确率至少提升了7个百分点。【局限】实验数据数量较少,未在更多的数据集上进行验证。【结论】所提方法提升了模型的语义表征能力,是一种有效的文本分类模型,为企业进行高效文本分类提供了有效支持。[Objective]This paper proposes a new classification model for Chinese texts,aiming to address the issues of weak structure,spelling errors or homonyms in the texts.[Methods]We constructed a multi-feature fusion method based on the traditional fusion features model for text classification.Then,we combined word level features,part of speech feature extension,the Chinese character features and the Pinyin letters to create multifeature semantic representation.Third,we introduced the new multi-semantic characteristics into the BiGRU to obtain the context semantics,which were processed with the multi-channel CNN to generate the main features.Finally,we merged these features for the softmax layer to finish the classification tasks,and predicted the required category labels.[Results]The accuracy of our multi-feature fusion model reached 83.3%and 91.1%with two datasets,which was 7%higher than the existing model.[Limitations]More research is needed to examine the model with larger datasets.[Conclusions]The proposed model could effectively finish the Chinese text classification tasks.

关 键 词:词性标记 词级别特征 文本分类 拼音字符特征 汉字字符特征 

分 类 号:G350[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象