检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:敬容 杨逸民 万福成[2] 国旗 于洪志[2] 马宁[2] JING Rong;YANG Yimin;WAN Fucheng;GUO Qi;YU Hongzhi;MA Ning(Key Laboratory of Linguistic and Cultural Computing Ministry of Education,Northwest Minzu University,Lanzhou,Gansu 730030,China;Key Laboratory of China s Ethnic Languages and Intelligent Processing of Gansu Province,Northwest Minzu University,Lanzhou,Gansu 730030,China;Dalian Meteorological Bureau,Dalian Meteorological Information Center,Dalian,Liaoning 116000,China)
机构地区:[1]西北民族大学语言与文化计算教育部重点实验室,甘肃兰州730030 [2]西北民族大学甘肃省民族语言智能处理重点实验室,甘肃兰州730030 [3]大连市气象局大连市气象信息中心,辽宁大连116000
出 处:《中文信息学报》2024年第9期58-65,共8页Journal of Chinese Information Processing
基 金:国家自然科学基金(62366046)。
摘 要:文本分类是自然语言处理领域的重要研究方向,由于藏文数据的稀缺性、语言学特征抽取的复杂性、篇章结构的多样性等因素导致藏文文本分类任务进展缓慢。因此,该文以图神经作为基础模型进行改进。首先,在“音节-音节”“音节-文档”建模的基础上,融合文档特征,采用二元分类模型动态网络构建“文档-文档”边,以充分挖掘短文本的全局特征,增加滑动窗口,减少模型的计算复杂度并寻找最优窗口取值。其次,针对藏文短文本的音节稀疏性,首次引入GraphSAGE作为基础模型,并探究不同聚合方式在藏文短文本分类上的性能差异。最后,为捕获节点间关系的异质性,对邻居节点进行特征加权再平均池化以增强模型的特征提取能力。在TNCC标题文本数据集上,该文模型的分类准确率达到了62.50%,与传统GCN、原始GraphSAGE和预训练语言模型CINO相比,该方法在分类准确率上分别提高了2.56%、1%和2.4%。Test classification is an important research direction in the field of natural language processing.The Tibetan text categorization is challenged by data scarcity,complexity of extracted linguistic features,and diversity of chapter structures.In this paper,we use graph neural model as the framework.Firstly,on the basis of the"syllable-syllable"and"syllable-document",we combine the document features to dynamically construct"document-document"edge,mining the global features of short text.We also increase the sliding window to find the optimal window value.Secondly,aimed at the syllable sparsity of Tibetan short text,GraphSAGE is introduced as the base model to explore the performance difference in different aggregation functions.Finally,to capture the heterogeneity of relationships between nodes,a feature-weighting approach is proposed based on average pooling.Experiments on the TNCC title dataset show our model has reached 62.50%accuracy,outperforming the GGN,the original GraphSAGE and the pre-trained language model CINO by 2.56%,1%and 2.4%,respectively.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.38