基于GraphSAGE网络的藏文短文本分类研究  

Research on Tibetan Short Text Classification Based on GraphSAGE Network

在线阅读下载全文

作  者:敬容 杨逸民 万福成[2] 国旗 于洪志[2] 马宁[2] JING Rong;YANG Yimin;WAN Fucheng;GUO Qi;YU Hongzhi;MA Ning(Key Laboratory of Linguistic and Cultural Computing Ministry of Education,Northwest Minzu University,Lanzhou,Gansu 730030,China;Key Laboratory of China s Ethnic Languages and Intelligent Processing of Gansu Province,Northwest Minzu University,Lanzhou,Gansu 730030,China;Dalian Meteorological Bureau,Dalian Meteorological Information Center,Dalian,Liaoning 116000,China)

机构地区:[1]西北民族大学语言与文化计算教育部重点实验室,甘肃兰州730030 [2]西北民族大学甘肃省民族语言智能处理重点实验室,甘肃兰州730030 [3]大连市气象局大连市气象信息中心,辽宁大连116000

出  处:《中文信息学报》2024年第9期58-65,共8页Journal of Chinese Information Processing

基  金:国家自然科学基金(62366046)。

摘  要:文本分类是自然语言处理领域的重要研究方向,由于藏文数据的稀缺性、语言学特征抽取的复杂性、篇章结构的多样性等因素导致藏文文本分类任务进展缓慢。因此,该文以图神经作为基础模型进行改进。首先,在“音节-音节”“音节-文档”建模的基础上,融合文档特征,采用二元分类模型动态网络构建“文档-文档”边,以充分挖掘短文本的全局特征,增加滑动窗口,减少模型的计算复杂度并寻找最优窗口取值。其次,针对藏文短文本的音节稀疏性,首次引入GraphSAGE作为基础模型,并探究不同聚合方式在藏文短文本分类上的性能差异。最后,为捕获节点间关系的异质性,对邻居节点进行特征加权再平均池化以增强模型的特征提取能力。在TNCC标题文本数据集上,该文模型的分类准确率达到了62.50%,与传统GCN、原始GraphSAGE和预训练语言模型CINO相比,该方法在分类准确率上分别提高了2.56%、1%和2.4%。Test classification is an important research direction in the field of natural language processing.The Tibetan text categorization is challenged by data scarcity,complexity of extracted linguistic features,and diversity of chapter structures.In this paper,we use graph neural model as the framework.Firstly,on the basis of the"syllable-syllable"and"syllable-document",we combine the document features to dynamically construct"document-document"edge,mining the global features of short text.We also increase the sliding window to find the optimal window value.Secondly,aimed at the syllable sparsity of Tibetan short text,GraphSAGE is introduced as the base model to explore the performance difference in different aggregation functions.Finally,to capture the heterogeneity of relationships between nodes,a feature-weighting approach is proposed based on average pooling.Experiments on the TNCC title dataset show our model has reached 62.50%accuracy,outperforming the GGN,the original GraphSAGE and the pre-trained language model CINO by 2.56%,1%and 2.4%,respectively.

关 键 词:图神经网络 藏文文本分类 TNCC数据集 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象