基于图卷积网络的藏文新闻文本分类  被引量:6

Tibetan News Text Classification Based on Graph Convolutional Networks

在线阅读下载全文

作  者:胥桂仙[1] 张子欣 于绍娜[1] 董玉双 田媛 Xu Guixian;Zhang Zixin;Yu Shaona;Dong Yushuang;Tian Yuan(Information Engineer College,Minzu University of China,Beijing 100081,China)

机构地区:[1]中央民族大学信息工程学院,北京100081

出  处:《数据分析与知识发现》2023年第6期73-85,共13页Data Analysis and Knowledge Discovery

基  金:国家社会科学基金项目(项目编号:19BGL241)的研究成果之一。

摘  要:【目的】针对藏文预训练知识缺少的现状,利用藏文音节和文档的构造关系,提出基于图卷积网络的藏文新闻文本分类方法。【方法】基于音节-音节关系和音节-文档关系为藏文新闻语料库构建文本图,然后使用音节和文档的独热表示进行初始化,在训练集文档类别标签的监督下,使用图卷积网络联合学习音节和文档的嵌入,最后将文本分类问题转化为节点分类问题。【结果】图卷积网络在藏文新闻正文文本分类任务上准确率达到70.44%,相比于基线模型高出8.96~20.66个百分点;在藏文新闻标题文本上准确率达到61.94%,比基线模型高出6.61~26.05个百分点。同时,图卷积网络相比引入预训练音节嵌入的SVM、CNN和少数民族语言预训练模型CINO在准确率上高出0.73~15.1个百分点,在正文上的准确率相比Word2Vec+LSTM方法高出15.65个百分点。【局限】仍依赖于有标注数据集,但藏文的有监督文本相对稀缺。【结论】图卷积网络在藏文新闻文本分类任务上具有有效性,能够解决藏文新闻文本信息杂乱的问题,有助于对各类别藏文新闻文本数据进行挖掘。[Objective]To improve pre-training knowledge in Tibetan,this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network(GCN)using the construction relationship between Tibetan syllables and documents.[Methods]First,we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations.Then,we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset.Finally,we transformed the text classification tasks into node classification.[Results]The Graph Convolutional Network achieves an accuracy of 70.44%on the classification of Tibetan news body texts,which is 8.96%-20.66%higher than the baseline models.It had a 61.94%accuracy on the Tibetan news titles,6.61%-26.05%higher than the baseline models.Additionally,the Graph Convolutional Network is 0.73%-15.1%higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO.It is 15.65%higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM.[Limitations]It still relies on labeled datasets in Tibetan,which are relatively scarce.[Conclusions]This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification.It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.

关 键 词:图卷积网络 藏文新闻文本分类 文本图 节点分类 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] G35[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象