基于双向长短时记忆模型的中文分词方法  被引量:12

Chinese Word Segmentation Method on the Basis of Bidirectional Long-Short Term Memory Model

在线阅读下载全文

作  者:张洪刚[1] 李焕[1] ZHANG Hong- gang LI Huan(School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China)

机构地区:[1]北京邮电大学信息与通信工程学院,北京100876

出  处:《华南理工大学学报(自然科学版)》2017年第3期61-67,共7页Journal of South China University of Technology(Natural Science Edition)

基  金:国家自然科学基金青年基金资助项目(61601042)~~

摘  要:中文分词是中文自然语言处理中的关键基础技术之一.目前,传统分词算法依赖于特征工程,而验证特征的有效性需要大量的工作.基于神经网络的深度学习算法的兴起使得模型自动学习特征成为可能.文中基于深度学习中的双向长短时记忆(BLSTM)神经网络模型对中文分词进行了研究.首先从大规模语料中学习中文字的语义向量,再将字向量应用于BLSTM模型实现分词,并在简体中文数据集(PKU、MSRA、CTB)和繁体中文数据集(HKCity U)等数据集上进行了实验.实验表明,在不依赖特征工程的情况下,基于BLSTM的中文分词方法仍可取得很好的效果.Chinese word segmentation is one of the fundamental technologies of Chinese natural language process-ing. At present, most conventional Chinese word segmentation methods rely on feature engineering, which re-quires intensive labor to verify the effectiveness. With the rapid development of deep learning, it becomes realistic to learn features automatically by using neural network. In this paper, on the basis of bidirectional long short-term memory ( BLSTM) model, a novel Chinese word segmentation method is proposed. In this method, Chinese cha-racters are represented into embedding vectors from a large-scale corpus, and then the vectors are applied to BLSTM model for segmentation. It is found from the experiments without feature engineering that the proposed method is of high performance in Chinese word segmentation on simplified Chinese datasets ( PKU, MSRA and CTB) and traditional Chinese dataset ( HKCityU).

关 键 词:深度学习 神经网络 双向长短时记忆 中文分词 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象