基于融合离散动作的双延迟深度确定性策略梯度算法的自动驾驶端到端行为决策方法被引量：2

An End-to-end Decision-making Method for Autonomous Driving Based on Twin Delayed Deep Deterministic Policy Gradient with Discrete

作　　者：杨璐[1,2] 王一权刘佳琦[1,2] 段玉林张荣辉 YANG Lu;WANG Yiquan;LIU Jiaqi;DUAN Yulin;ZHANG Ronghui(Tianjin Key Laboratory for Advanced Mechatronic System Design and Intelligent Control,School of Mechanical Engineering,Tianjin 300384,China;National Demonstration Center for Experimental Mechanical and Electrical Engineering Education,Tianjin University of Technology,Tianjin 300384,China;Institute of Agricultural Resources and Regional Planning,Chinese Academy of Agricultural Sciences,Beijing 100081,China;Guangdong Provincial Key Laboratory of Intelligent Transport System,Sun Yat-sen University,Guangzhou 510275,China)

机构地区：[1]天津理工大学天津市先进机电系统设计与智能控制重点实验室,天津300384 [2]天津理工大学机电工程国家级实验教学示范中心,天津300384 [3]中国农业科学院农业资源与农业区划研究所,北京100081 [4]中山大学广东省智能交通系统重点实验室,广州510275

出　　处：《交通信息与安全》2022年第1期144-152,共9页Journal of Transport Information and Safety

基　　金：中国农业科学院国际农业科学计划项目(CAAS-ZDRW202107);国家自然科学基金项目(52172350、51775565);天津市研究生科研创新项目(2020YJSZXS05)资助。

摘　　要：针对基于强化学习的车辆驾驶行为决策方法存在的学习效率低、动作变化不平滑等问题,研究了1种融合不同动作空间网络的端到端自动驾驶决策方法,即融合离散动作的双延迟深度确定性策略梯度算法(TD3WD)。在基础双延迟深度确定性策略梯度算法(TD3)的网络模型中加入1个输出离散动作的附加Q网络辅助进行网络探索训练,将TD3网络与附加Q网络的输出动作进行加权融合,利用融合后动作与环境进行交互,对环境进行充分探索,以提高对环境的探索效率;更新Critic网络时,将附加网络输出作为噪声融合到目标动作中,鼓励智能体探索环境,使动作值预估更加准确;利用预训练的网络获取图像特征信息代替图像作为状态输入,降低训练过程中的计算成本。利用Carla仿真平台模拟自动驾驶场景对所提方法进行验证,结果表明:在训练场景中,所提方法的学习效率更高,比TD3和深度确定性策略梯度算法(DDPG)等基础算法收敛速度提升约30%;在测试场景中,所提出的算法的收敛后性能更好,平均压线率和转向盘转角变化分别降低74.4%和56.4%。There are issues for the decision support method for automated driving based on reinforcement learning,such as low learning efficiency and non-continuous actions. Therefore,an end-to-end decision-making method for autonomous driving is developed based on the Twin Delayed Deep Deterministic Policy Gradient with Discrete(TD3WD) algorithm,which can be used to fuse the information from different action spaces over a network. In the network of traditional Twin Delayed Deep Deterministic Policy Gradient(TD3) algorithm,an additional Q network that outputs discrete actions is added to assist exploration training. Weighted fusion of the output actions of TD3 network and additional Q network is performed. The fused actions interact with the environment,in order to fully explore the environment and enhance the efficiency of the environment exploration. When the Critic network is updated,the output of the attached network is merged into the target actions as noise to encourage the agent to explore the environment and obtain better action estimates. Instead of the original images,image feature obtained from the pre-trained network is used as the state input to reduce the computational cost in the training process. The proposed model is tested under a set of simulated autonomous driving scenarios generated by Carla simulation platform. The results show that the convergence speed of the proposed method is about 30% higher than that of traditional reinforcement learning algorithms like TD3 and Deep Deterministic Policy Gradient(DDPG)under the training scenarios. Under the testing scenarios,the proposed method shows better convergent performances and the average rate of lane-crossing and the change rate of steering angle are reduced by 74.4% and 56.4% respectively.

关键词：自动驾驶端到端决策深度强化学习动作空间

分类号：U463.6[机械工程—车辆工程] TP181[交通运输工程—载运工具运用工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于融合离散动作的双延迟深度确定性策略梯度算法的自动驾驶端到端行为决策方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于融合离散动作的双延迟深度确定性策略梯度算法的自动驾驶端到端行为决策方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于融合离散动作的双延迟深度确定性策略梯度算法的自动驾驶端到端行为决策方法被引量：2