基于对比学习和迁移学习的自动音频字幕系统  

Automated audio captioning system based on contrastive learning and transfer learning

在线阅读下载全文

作  者:潘超凡 童骁 彭焘 李圣辰 朱晨阳 邵曦[1] PAN Chaofan;TONG Xiao;PENG Tao;LI Shengchen;ZHU Chenyang;SHAO Xi(School of Communications and Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China;School of Advanced Technology,Xi'an Jiaotong-Liverpool University,Suzhou 215123,Jiangsu,China)

机构地区:[1]南京邮电大学通信与信息工程学院,南京210003 [2]西交利物浦大学智能工程学院,江苏苏州215123

出  处:《智能计算机与应用》2025年第3期1-6,共6页Intelligent Computer and Applications

基  金:国家科技创新2030—“新一代人工智能”重大项目(2020AAA0106200);国家自然科学基金(61936005,62001038);姑苏领军人才青年人才创新项目(ZXL2022472)。

摘  要:自动音频字幕是一项跨模态翻译任务,旨在使用自然语言来描述一段音频剪辑的内容。该任务近年来受到国内外广泛关注。现有的自动音频字幕系统通常基于编码器-解码器结构,而数据稀缺问题始终是自动音频字幕系统训练面临的一大难题。针对这一问题,文中提出一种新的模型架构,称为预编码器-编码器-解码器模型。在预编码器阶段,采用对比学习的方法从原始音频和配对文本数据中提取自监督信号,同时采用了迁移学习加快训练,并为编码器提供初始化参数。在Clotho数据集上的实验结果表明,文中提出的系统与基线系统相比性能显著提升。Automated audio captioning is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.This task has received widespread attention both domestically and abroad in recent years.Existing automated audio captioning systems typically rely on an encoder-decoder structure,with data scarcity being a major challenge for training such systems.To address this issue,the paper proposes a new model architecture called the pre-encoder-encoder-decoder model.In the pre-encoder stage,contrastive learning is used to extract self-supervised signals from raw audio and paired text data,while transfer learning is employed to accelerate training and provide initialization parameters for the encoder.Experimental results on the Clotho dataset show significant performance improvements of the proposed system compared to the baseline system.

关 键 词:自动音频字幕 跨模态翻译 对比学习 迁移学习 音频剪辑 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象