机构地区:[1]中国科学院自动化研究所复杂系统管理与控制国家重点实验室,北京100190 [2]内蒙古电投能源股份有限公司北露天煤矿,内蒙古通辽029200
出 处:《交通运输研究》2023年第1期31-39,85,共10页Transport Research
基 金:广东省重点领域研发计划项目(2020B0909050001);国家自然科学基金项目(U1909204)。
摘 要:在自动驾驶决策场景下,为解决强化学习算法安全性差、学习效率低的问题,提出一种在算法的训练阶段添加基于价值的安全约束和虚拟奖励的方法。首先,利用状态、动作价值函数和安全判断规则,对智能体执行的动作进行基于价值的安全约束,选择价值高且安全的动作。然后,向回放池添加包含虚拟奖励的预测轨迹数据,以补充由于约束而未能获取的试错动作信息和相应的状态、奖励信息。最后,为进行加减速和换道决策实验,基于修改后的高速公路仿真环境highway-env搭建了3车道高速公路场景,并以深度Q网络(Deep Q Network,DQN)算法为基础,分别训练和测试了无安全约束的算法、拥有基于规则的安全约束的算法和拥有基于价值的安全约束的算法。结果表明,考虑加速、减速、保持车速和车道、向左换道、向右换道共5种动作时,基于价值的安全约束算法的成功率比无安全约束的算法高3倍以上,平均回报提升28%;仅考虑向左换道、向右换道、保持车道这3种换道动作时,基于价值的安全约束算法的成功率比基于规则的安全约束算法高0.11,平均回报提升6%;都添加基于价值的安全约束时,考虑5种动作的算法相较于考虑3种动作的算法成功率低0.06但平均行驶速度快0.26m/s,也即前者实现了对安全和速度的平衡。由此可知,基于价值的安全约束算法比基于规则的算法更能提升强化学习算法的安全性和训练效率,而包含更多决策动作的动作空间设置可实现更高的驾驶技巧,避免算法过于保守。In autonomous driving decision making,in order to solve the poor safety and low learning efficiency problems of reinforcement learning algorithms,a method of adding value-based safety constraints and virtual rewards in the training phase of algorithm was proposed.Firstly,it used the value function of states and actions together with the safety judgment rules to process the value-based safety constraints on agent′s actions.Action with the highest value and safety was selected.Secondly,predicted trajectory data with virtual rewards was added to the replay buffer in order to provide the trial-anderror information which was missed because of the constraints.Finally,to conduct experiments on speed-change and lane-change decision-making,a highway scenario with 3 lanes were built based on the modified highway-env simulation environment.Based on DQN(Deep Q Network),three types of algorithms were trained and tested:algorithms without safety constraints,with rule-based safety constraints and with value-based safety constraints.The result showed that:while considering all 5 kinds of actions:speeding up,speeding down,keeping the speed and lane,changing to left lane and changing to right lane,the algorithm with value-based safety constraints could outperform the algorithm without constraints in success rate by 3 times and average return by 28%;while only considering 3kinds of actions:changing to left lane,changing to right lane and keeping the lane,the algorithm with value-based safety constraints outperformed the one with rule-based safety constraints as much as 0.11in success rate and 6%in average return;further,while both adding the value-based safety constraints,algorithm with 5 actions was 0.06 lower than that with 3 actions in success rate,but was 0.26m/s higher in average driving speed,which means the former reached a balance between safety and speed.In conclusion,the value-based safety constrained method outperforms the rule-based one in the improvement of reinforcement learning′s safety and training efficiency;in
关 键 词:深度强化学习 自动驾驶 决策 安全约束 训练效率
分 类 号:U495[交通运输工程—交通运输规划与管理] TP181[交通运输工程—道路与铁道工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...