检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Caiwei Yang Xinting Yang Kaijie Zhu Chao Zhou
机构地区:[1]National Engineering Research Center for Information Technology in Agriculture,Beijing 100097,China [2]Information Technology Research Center,Beijing Academy of Agriculture and Forestry Sciences,Beijing 100097,China [3]National Engineering Laboratory for Agri-product Quality Traceability,Beijing 100097,China [4]College of Computer and Information Engineering,Tianjin Agricultural University,Tianjin 300384,China
出 处:《Journal of Beijing Institute of Technology》2023年第3期285-297,共13页北京理工大学学报(英文版)
基 金:supported by the Beijing Natural Science Foundation(No.6212007);the National Key Technology R&D Program of China(No.2022YFD2001701);the Youth Research Fund of Beijing Academy of Agricultural and Forestry Sciences(No.QNJJ202014)。
摘 要:Realtime analyzing the feeding behavior of fish is the premise and key to accurate guidance on feeding.The identification of fish behavior using a single information is susceptible to various factors.To overcome the problems,this paper proposes an adaptive deep modular co-attention unified multi-modal transformers(DMCA-UMT).By fusing the video,audio and water quality parameters,the whole process of fish feeding behavior could be identified.Firstly,for the input video,audio and water quality parameter information,features are extracted to obtain feature vectors of different modalities.Secondly,deep modular co-attention(DMCA)is introduced on the basis of the original cross-modal encoder,and the adaptive learnable weights are added.The feature vector of video and audio joint representation is obtained by automatic learning based on fusion contribution.Finally,the information of visual-audio modality fusion and text features are used to generate clip-level moment queries.The query decoder decodes the input features and uses the prediction head to obtain the final joint moment retrieval,which is the start-end time of feeding the fish.The results show that the mAP Avg of the proposed algorithm reaches 75.3%,which is37.8%higher than that of unified multi-modal transformers(UMT)algorithm.
关 键 词:AQUACULTURE multi-modal fusion deep modular co-attention(DMCA) unified multimodal transformers(UMT) video moment retrieval
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.51