基于词素切分的低资源语言文本分类  

Low recourse languages text classification based on morpheme segmentation

作  者:沙尔旦尔·帕尔哈提 木塔力甫·沙塔尔 阿力木江·亚森 阿布都热合曼·卡的尔 Sardar·Parhat;Mutallip·Sattar;Alimjan·Yasin;Abdurahman·Kadir(School of Information Management,Xinjiang University of Finance and Economics,Urumqi 830012,China)

机构地区:[1]新疆财经大学信息管理学院,新疆乌鲁木齐830012

出  处:《计算机工程与设计》2025年第2期530-536,共7页Computer Engineering and Design

基  金:国家自然科学基金项目(61662073、62241208);国家社会科学基金项目(23XMZ060);新疆财经大学校级科研基金项目(2022XGC022、2022XGC049)。

摘  要:针对维-哈-柯等派生类低资源语言文本分类中特征空间维数巨增、特征提取效率低等问题,提出一种基于Bi-LSTM_CRF的词素切和基于Bi-LSTM_Attention的文本分类方法。对实验文本进行词素切分及词干提取以有效减少特征空间维数,采用BERT嵌入向量表示较好地保留文本语义信息。将Bi-LSTM与Attention机制结合构建文本分类模型,有效提取文本词干之间长距离依赖关系特征,以此提高维-哈-柯语文本分类的效果,分别得到了96.68%、96.72%和96.54%的分类准确率。实验结果表明,高效词素切分和嵌入向量表示方法能够提高维-哈-柯等低资源语言文本分类的效果。Aiming at the problems of significantly increasing in feature space dimension and low efficiency in feature extraction in text classification of low resource derivative languages such as Uyghur,Kazakh and Kirghiz,a Bi-LSTM_CRF based morpheme segmentation and Bi-LSTM_Attention based text classification methods were proposed.The morpheme segmentation and stemming operations were performed on experimental datasets to reduce feature space dimension effectively,and the BERT embedded vector representation was used to better preserve textual semantic information.By combining Bi-LSTM with Attention mechanism,a classification model was constructed to effectively extract long-distance dependency features between text stems.The effectiveness of Uyghur,Kazakh and Kirghiz text classification is improved,and the classification accuracy of 96.68%,96.72%,and 96.54%is obtained,respectively.Experimental results show that efficient morpheme segmentation and stem-embedding vector representation methods can improve the performance of text classification for low resource languages such as Uyghur,Kazakh and Kirghiz.

关 键 词:维-哈-柯语 词素切分 词干提取 词干嵌入向量 特征表示 神经网络 文本分类 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象