面向视频会议的音频辅助视频编码方法  

An Audio-aided Video Compression Method for Video Conferencing

在线阅读下载全文

作  者:徐盛鹏 覃浩峻 宋晓丹[1,2] 左旭光 高大化 谢雪梅[1] 石光明[1] XU Shengpeng;QIN Haojun;SONG Xiaodan;ZUO Xuguang;GAO Dahua;XIE Xuemei;SHI Guangming(Xidian University,Xi'an 710071,China;Guangzhou Institute of Technology,Xidian University,Guangzhou 510555,China;NETINT Technologies,Shanghai 200120,China)

机构地区:[1]西安电子科技大学,陕西西安710071 [2]西安电子科技大学广州研究院,广东广州510555 [3]铭微电子(上海)有限公司,上海200120

出  处:《移动通信》2024年第2期77-82,共6页Mobile Communications

基  金:国家重点研发计划资助“面向多模态业务的语义通信系统架构与关键技术研究”(2022YFB2902900);国家自然基金“基于语义的图像编码方法研究”(62101398);广州市基础与应用基础研究项目“面向图像理解应用的低带宽、解码即理解图像编码技术研究”(202201011390);国家自然基金重大项目子课题“语义信息弹性编译码理论与方法”(62293483);广州市科技计划基础研究计划“广州市场景理解与智能交互重点实验室”(20220100001)。

摘  要:目前视频会议所包含的视频和音频通常是使用传统的编码标准分别进行压缩。然而从语义层面看,音频和视频存在强相关性,都是对与会者所要表述内容的表征。因此,对两者分开编码是次优的。针对此问题,提出了一种音频辅助的视频编码框架。该框架中视频只传输少量的关键帧以提供必要的纹理参考,利用从重建音频中推理得到时序信息和关键帧来重建其余帧。实验结果表明,与通用视频编码方法相比,该框架在指标DISTS下取得了-89.81%的BD-rate结果。During video communications,bandwidth is often limited due to network fluctuations or harsh environments,and the user experience relies heavily on the compression efficiency of video and audio.Although video compression efficiency has been significantly improved,the video reconstruction still suffers from severe distortion,blurring or block artifacts at low bitrate.The video and audio in video conferencing are usually compressed separately using traditional coding standards.However,from the view of semantics,audio and video are strongly correlated due to the same speakers'intending meaning.Thus,the separate compression methods are sub-optimal.To address these problems,inspired by the work on audio-driven talking face generation,an audio-aided video coding framework is proposed.The idea is that the temporal information within the video can be inferred from the audio and thus can be removed from transmission.Specifically,the framework samples the video temporally at the encoder(usually the frst frame)and compresses it using an image encoder to provide the necessary textures.At the same time,the input audio is encoded for transmission.At the decoder,the image and audio are reconstructed from the stream.The audio is then decoupled into emotional and textual features,respectively.After that,a key point sequence is generated from these features and an offline key point reference by modeling the temporal correlation as the motion of key points.Since a mismatch may exist between the key points of the current video and the offline ones,a linear transform with scale and offset factors is introduced for alignment.Next,the transformed key points are connected to obtain an edge map of the face region.In order to get a more realistic background content,the edges in the key frame are extracted and those within the face region are replaced with the ones from the generated key point sequence.Finally,the reconstructed video is generated based on the edge map and the reference image.Compared with the latest VVC,our proposed scheme

关 键 词:多模态信源编码 音频辅助视频编码 视频会议 低码率 语义保真度 

分 类 号:TN762[电子电信—电路与系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象