基于词嵌入扩充的口语对话文本领域分类  

Domain Classification in Spoken Dialogue Texts Based on Word Embedded Extension

在线阅读下载全文

作  者:杨萌萌[1] 黄浩[1] 

机构地区:[1]新疆大学信息科学与工程学院,新疆乌鲁木齐830046

出  处:《新疆大学学报(自然科学版)》2016年第2期209-214,220,共7页Journal of Xinjiang University(Natural Science Edition)

基  金:国家自然科学基金(61365005;60965002)

摘  要:针对口语对话系统领域分类任务中传统领域分类方法如SVM需要进行大量人工标注的问题,将LDA(Latent Dirichlet Allocation)模型应用于口语对话系统领域分类;针对口语对话内容少、长度短、数据稀疏等问题,在LDA模型基础上提出了基于词嵌入文本扩充的口语对话系统领域分类方法.该方法主要特点是:1)使用词嵌入方法word2vec对类似于短文本的语音识别后的口语对话文本进行语义扩充,将短文本转化为长文本,使主题模型LDA更加有效地估计口语对话文本的隐含主题;2)采用无监督的概率生成模型LDA对扩充后的口语对话文本进行建模以及领域分类,从而降低人工标注成本.实验结果表明,与直接使用LDA模型进行口语对话系统领域分类方法对比,适当扩充长度的word2vec文本扩充方法在口语对话系统领域分类中的平均准确率、平均召回率和平均F1值分别提高了26.1%、25.5%、27.2%,且该方法具有一定的鲁棒性..Aiming at the problem of artificial tagging in traditional classification methods such as SVM method in domain classification task of Spoken Dialogue System,LDA(Latent Dirichlet Allocation) model is applied in domain classification of Spoken Dialogue System.Aiming at problems of shot and less words in spoken dialogue text as well as data sparseness,a method of word embedded text extension based in task of domain classification in Spoken Dialogue System is proposed on the basis of LDA model.The main features of the method are as follows:1) using word embedded method,word2 vec,to semantically expand the spoken dialogue text after speech recognition,which is similar to short text,so as to convert it to long text and let LDA model effectively estimate the implied subjects of the spoken dialogue text;2) using unsupervised probability generation model LDA to model and classify the expanded spoken dialogue text so as to decrease the cost of manual annotation.To compare with the method of using LDA model directly,the experimental result shows that the average accuracy,average recall rate and average F1 measure are increased by 26.1%,25.5%and 25.5%respectively as well as robustness for the method of word embedded text extension,word2 vec,in domain classification of Spoken Dialogue System.

关 键 词:口语对话系统 口语理解 潜在狄利克雷分布 主题模型 文本扩充 

分 类 号:TP302.7[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象