检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Jiantao Qiu Yu Wang
出 处:《Tsinghua Science and Technology》2023年第1期155-166,共12页清华大学学报(自然科学版(英文版)
摘 要:Reinforcement Learning(RL)algorithms work well with well-defined rewards,but they fail with sparse/deceptive rewards and require additional exploration strategies.This work introduces a deep exploration method based on the Upper Confidence Bound(UCB)bonus.The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic.Based on the conclusion of the regret bound under the linear Markov decision process approximation,we use the feature matrix to calculate the UCB bonus for deep exploration.The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations.Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy.We design a simple task,“swim”,to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments.Then,we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks.The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.
关 键 词:Reinforcement Learning(RL) Neural Network(NN) Upper Confidence Bound(UCB)
分 类 号:TN9[电子电信—信息与通信工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7