基于正样本对比与掩蔽重建的自监督语音表示学习被引量：1

Self-supervised speech representation learning based on positive sample comparison and masking reconstruction

作　　者：张文林[1] 刘雪鹏牛铜[1] 陈琦[1] 屈丹[1] ZHANG Wenlin;LIU Xuepeng;NIU Tong;CHEN Qi;QU Dan(College of Information System Engineering,Information Engineering University,Zhengzhou 450001,China)

机构地区：[1]信息工程大学信息系统工程学院,河南郑州450001

出　　处：《通信学报》2022年第7期163-171,共9页Journal on Communications

基　　金：国家自然科学基金资助项目(No.61673395,No.62171470)。

摘　　要：针对现有基于对比预测的自监督语音表示学习方法在训练时需要构建大量负样本,其学习效果依赖于大批次训练,需要耗费大量计算资源的问题,提出了一种仅使用正样本进行语音对比学习的方法,并将其与掩蔽重建任务相结合得到一种多任务自监督语音表示学习方法,在降低训练复杂度的同时提高语音表示学习的性能。其中,正样本对比学习任务,借鉴图像自监督表示学习中SimSiam方法的思想,采用孪生网络架构对原始语音信号进行两次数据增强,并使用相同的编码器进行处理,将一个分支经过一个前向网络,另一个分支使用梯度停止策略,调整模型参数以最大化2个分支输出的相似度。整个训练过程中不需要构造负样本,可采用小批次进行训练,大幅提高了学习效率。使用LibriSpeech语料库进行自监督表示学习,并在多种下游任务中进行微调测试,对比实验表明,所提方法得到的模型在多个任务中均达到或者超过了现有主流语音表示学习模型的性能。To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples,and their performance depends on large training batches,requiring a lot of computing resources,a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss,the proposed method could obtain better repre-sentation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture,two random augmentations of the input speech signals were processed by the same encoder network,then a feed-forward network was applied on one side,and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing,negative samples were not required,so small batch size could be used and training effi-ciency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.

关键词：语音表示自监督学习无监督学习孪生网络

分类号：TN912.34[电子电信—通信与信息系统]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于正样本对比与掩蔽重建的自监督语音表示学习被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于正样本对比与掩蔽重建的自监督语音表示学习 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于正样本对比与掩蔽重建的自监督语音表示学习被引量：1