基于多模态视频分类任务的模态融合策略研究  被引量:1

Modality Fusion Strategy Research Based on Multimodal Video Classification Task

在线阅读下载全文

作  者:王一帆 张雪芳 WANG Yifan;ZHANG Xuefang(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430070,China)

机构地区:[1]武汉邮电科学研究院,武汉430070

出  处:《计算机科学》2024年第S01期489-493,共5页Computer Science

基  金:国家重点研发计划(2019YFB1803600)。

摘  要:尽管过往人工智能相关技术在众多领域取得了成功,但是通常只是模拟了人类的某一种感知能力,也就意味着被限制在处理单个模态的信息之中。从多个模态信息中提取特征并进行有效融合对于从弱/限制领域人工智能向强/通用人工智能的发展迈进具有重要意义。本研究基于编码器-解码器结构,在视频分类任务上对多模态信息的特征编码进行早期特征融合、对各模态信息的预测结果进行后期决策融合以及对两者相结合的不同多模态信息融合策略进行了对比研究;同时对音频模态信息参与模态融合的两种方式进行了对比,即直接将音频进行特征编码进而参与模态融合或音频通过语音转文本进而以文本的形式参与模态融合。实验结果表明,将文本和音频模态单独的预测结果与另外两种模态的融合特征的预测结果进行决策融合能够进一步提高分类预测准确率;此外,通过语音识别将语音转换成文本模态信息,能够更加充分利用其中包含的语义信息。Despite the success of AI-related technologies in many fields,they usually simulate only one type of human perception,which means that they are limited to process information from a single modality.Extracting features from multiple modal information and fusing them effectively is important for developing general AI.In this paper,a comparative study of different multimodal information fusion strategies based on an encoder-decoder architecture with early feature fusion for feature encoding of multimodal information,late decision fusion for prediction results of each modal information,and a combination of both is conducted on a video classification task.This paper also compares two ways to involve audio modal information in modal fusion,i.e.,directly encoding audio with features and then participating in modal fusion or audio by speech-to-text and then participating in modal fusion in the form of text.Experiments show that decision fusion of the prediction results of text and audio modalities alone with those of the fused features of the other two modalities can further improve the classification prediction accuracy under the experimental approach of this study.Moreover,converting speech into text modal information by ASR(Automatic Speech Recognition)can make fuller use of the semantic information contained in it.

关 键 词:多模态 模态融合 语音识别 视频分类 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象