检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张文林[1] 刘雪鹏 牛铜[1] 陈琦[1] 屈丹[1] ZHANG Wenlin;LIU Xuepeng;NIU Tong;CHEN Qi;QU Dan(College of Information System Engineering,Information Engineering University,Zhengzhou 450001,China)
机构地区:[1]信息工程大学信息系统工程学院,河南郑州450001
出 处:《通信学报》2022年第7期163-171,共9页Journal on Communications
基 金:国家自然科学基金资助项目(No.61673395,No.62171470)。
摘 要:针对现有基于对比预测的自监督语音表示学习方法在训练时需要构建大量负样本,其学习效果依赖于大批次训练,需要耗费大量计算资源的问题,提出了一种仅使用正样本进行语音对比学习的方法,并将其与掩蔽重建任务相结合得到一种多任务自监督语音表示学习方法,在降低训练复杂度的同时提高语音表示学习的性能。其中,正样本对比学习任务,借鉴图像自监督表示学习中SimSiam方法的思想,采用孪生网络架构对原始语音信号进行两次数据增强,并使用相同的编码器进行处理,将一个分支经过一个前向网络,另一个分支使用梯度停止策略,调整模型参数以最大化2个分支输出的相似度。整个训练过程中不需要构造负样本,可采用小批次进行训练,大幅提高了学习效率。使用LibriSpeech语料库进行自监督表示学习,并在多种下游任务中进行微调测试,对比实验表明,所提方法得到的模型在多个任务中均达到或者超过了现有主流语音表示学习模型的性能。To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples,and their performance depends on large training batches,requiring a lot of computing resources,a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss,the proposed method could obtain better repre-sentation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture,two random augmentations of the input speech signals were processed by the same encoder network,then a feed-forward network was applied on one side,and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing,negative samples were not required,so small batch size could be used and training effi-ciency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.
分 类 号:TN912.34[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.148.241.79