基于正样本对比与掩蔽重建的自监督语音表示学习  被引量:1

Self-supervised speech representation learning based on positive sample comparison and masking reconstruction

在线阅读下载全文

作  者:张文林[1] 刘雪鹏 牛铜[1] 陈琦[1] 屈丹[1] ZHANG Wenlin;LIU Xuepeng;NIU Tong;CHEN Qi;QU Dan(College of Information System Engineering,Information Engineering University,Zhengzhou 450001,China)

机构地区:[1]信息工程大学信息系统工程学院,河南郑州450001

出  处:《通信学报》2022年第7期163-171,共9页Journal on Communications

基  金:国家自然科学基金资助项目(No.61673395,No.62171470)。

摘  要:针对现有基于对比预测的自监督语音表示学习方法在训练时需要构建大量负样本,其学习效果依赖于大批次训练,需要耗费大量计算资源的问题,提出了一种仅使用正样本进行语音对比学习的方法,并将其与掩蔽重建任务相结合得到一种多任务自监督语音表示学习方法,在降低训练复杂度的同时提高语音表示学习的性能。其中,正样本对比学习任务,借鉴图像自监督表示学习中SimSiam方法的思想,采用孪生网络架构对原始语音信号进行两次数据增强,并使用相同的编码器进行处理,将一个分支经过一个前向网络,另一个分支使用梯度停止策略,调整模型参数以最大化2个分支输出的相似度。整个训练过程中不需要构造负样本,可采用小批次进行训练,大幅提高了学习效率。使用LibriSpeech语料库进行自监督表示学习,并在多种下游任务中进行微调测试,对比实验表明,所提方法得到的模型在多个任务中均达到或者超过了现有主流语音表示学习模型的性能。To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples,and their performance depends on large training batches,requiring a lot of computing resources,a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss,the proposed method could obtain better repre-sentation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture,two random augmentations of the input speech signals were processed by the same encoder network,then a feed-forward network was applied on one side,and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing,negative samples were not required,so small batch size could be used and training effi-ciency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.

关 键 词:语音表示 自监督学习 无监督学习 孪生网络 

分 类 号:TN912.34[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象