检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:胥桂仙[1] 张子欣 于绍娜[1] 董玉双 田媛 Xu Guixian;Zhang Zixin;Yu Shaona;Dong Yushuang;Tian Yuan(Information Engineer College,Minzu University of China,Beijing 100081,China)
出 处:《数据分析与知识发现》2023年第6期73-85,共13页Data Analysis and Knowledge Discovery
基 金:国家社会科学基金项目(项目编号:19BGL241)的研究成果之一。
摘 要:【目的】针对藏文预训练知识缺少的现状,利用藏文音节和文档的构造关系,提出基于图卷积网络的藏文新闻文本分类方法。【方法】基于音节-音节关系和音节-文档关系为藏文新闻语料库构建文本图,然后使用音节和文档的独热表示进行初始化,在训练集文档类别标签的监督下,使用图卷积网络联合学习音节和文档的嵌入,最后将文本分类问题转化为节点分类问题。【结果】图卷积网络在藏文新闻正文文本分类任务上准确率达到70.44%,相比于基线模型高出8.96~20.66个百分点;在藏文新闻标题文本上准确率达到61.94%,比基线模型高出6.61~26.05个百分点。同时,图卷积网络相比引入预训练音节嵌入的SVM、CNN和少数民族语言预训练模型CINO在准确率上高出0.73~15.1个百分点,在正文上的准确率相比Word2Vec+LSTM方法高出15.65个百分点。【局限】仍依赖于有标注数据集,但藏文的有监督文本相对稀缺。【结论】图卷积网络在藏文新闻文本分类任务上具有有效性,能够解决藏文新闻文本信息杂乱的问题,有助于对各类别藏文新闻文本数据进行挖掘。[Objective]To improve pre-training knowledge in Tibetan,this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network(GCN)using the construction relationship between Tibetan syllables and documents.[Methods]First,we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations.Then,we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset.Finally,we transformed the text classification tasks into node classification.[Results]The Graph Convolutional Network achieves an accuracy of 70.44%on the classification of Tibetan news body texts,which is 8.96%-20.66%higher than the baseline models.It had a 61.94%accuracy on the Tibetan news titles,6.61%-26.05%higher than the baseline models.Additionally,the Graph Convolutional Network is 0.73%-15.1%higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO.It is 15.65%higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM.[Limitations]It still relies on labeled datasets in Tibetan,which are relatively scarce.[Conclusions]This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification.It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.
分 类 号:TP391[自动化与计算机技术—计算机应用技术] G35[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117