基于多层级视频Transformer的视觉自动定位方法

Visual Automatic Localization Method Based on Multi-level Video Transformer

作　　者：邹琦萍[1] 李博涛陈赛安郭茜张桃红[2,3] ZOU Qiping;LI Botao;CHEN Saian;GUO Xi;ZHANG Taohong(Key Laboratory of AI and Information Processing(Hechi University),Education Department of Guangxi Zhuang Autonomous Region,Hechi 546300,China;School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China;Beijing Key Laboratory of Knowledge Engineering for Materials Science,Beijing 100083,China)

机构地区：[1]广西高校人工智能与信息处理重点实验室(河池学院),广西河池546300 [2]北京科技大学计算机与通信工程学院,北京100083 [3]材料领域知识工程北京市重点实验室,北京100083

出　　处：《工程科学与技术》2024年第6期34-43,共10页Advanced Engineering Sciences

基　　金：科技部科技创新2030–重大项目(2020AAA0108703);广西高校人工智能与信息处理重点实验室基金项目(2022GXZDSY007)。

摘　　要：工业自动化产线中,设备的异常检测直接决定加工质量,由机械臂和搭载于机械臂前端的工业相机构成的视觉系统可以有效监测此类异常。本文使用六轴机械臂搭载工业相机对工件表面进行成像,获取由模糊到清晰再到模糊的视频序列,以此选出最清晰的视频帧作为自动加工中有聚焦要求的距离指导,以进行聚焦异常修正,从而实现自动定位。提出一种基于多层级视频Transformer的视频分类模型多级视频Transformer(MLVT)用于高语义级别的视频表征学习,并用于选出视频序列中成像最清晰的帧。首先,提出一种具有多种感受野的token划分方法多级标记(MLT),能够将原始视频数据按2D图像补丁、3D图像补丁、帧和片段这4个层级划分成token序列,并在加入位置编码之后送入多级编码器(MLE)方法进行注意力的计算。为了缓解多层级的tokens带来的计算代价和收敛速度慢的问题,MLE引入一种逐层的可变形注意力机制逐层可变形注意力机制(LWLA),以一种可学习的方式代替全局注意力进行特征相似性的计算。最终,该方法3个版本的模型在本文的视频数据集上分别取得了87.2%、88.6%、88.9%的分类准确率,在与同参数量级的主流视频Transformer实验对比中均表现了最优的性能,有效地完成了从视频序列中选择出最清晰帧的任务,能够为下游视觉任务的性能提供强有力保障。Objective This study investigates the advanced application of a six-axis robotic arm equipped with a high-resolution industrial camera to capture precise images of workpiece surfaces.The setup is designed to acquire a dynamic video sequence illustrating the transition of image clarity,starting from blurry,achieving optimal clarity,and then reverting to blurry.The primary goal is to select the clearest frame from this sequence,which is critical in determining the precise focusing distance required for automated machining processes.The industrial camera is strategically mounted on the robotic arm,which meticulously controls the camera’s downward trajectory,ensuring the capture of varying image qualities.As the camera descends,it records the shifting focus on the workpiece surface,from out-of-focus(blurry)to in-focus(clear),and back to out-of-focus.This fluctuation is crucial,as blurry images can significantly impair the performance of subsequent tasks,particularly those involving deep learning-based intelligent recognition systems utilized in modern manufacturing.Blurry images may result in inaccurate feature recognition,adversely affecting the quality and precision of automated operations.An effective and precise video processing methodology is utilized to address these challenges.This approach incorporates advanced image processing techniques to analyze video sequences captured by the industrial camera.Sophisticated algorithms enable the system to identify the frame with optimal clarity and sharpness.This frame is critical feedback for adjusting the robotic arm,ensuring that the camera aligns precisely with the position where the focal length is accurately calibrated to the workpiece’s surface.This process guarantees the high quality of captured images and boosts the overall efficiency of the machining process.The system significantly reduces human error and enhances the consistency of output,which is crucial in high-precision manufacturing environments by automating the focus adjustment based on the clearest

关键词：视频Transformer 视频分类视觉自动定位可变形注意力

分类号：TP391.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多层级视频Transformer的视觉自动定位方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多层级视频Transformer的视觉自动定位方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索