主 题:Quantile-Based Deep Reinforcement Learning using Two-Timescale Policy Gradient Algorithms
主讲人:彭一杰 副教授、博士生导师
时 间:2024年1月18日 10:00——11:30
地 点:崇仁楼110会议室
摘 要:Classical reinforcement learning (RL) aims to optimize the expected cumulative reward. In this work, we consider the RL setting where the goal is to optimize the quantile of the cumulative reward. We parameterize the policy controlling actions by neural networks, and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO) and its variant Quantile-Based Proximal Policy Optimization (QPPO) for solving deep RL problems with quantile objectives. QPO uses two coupled iterations running at different timescales for simultaneously updating quantiles and policy parameters, whereas QPPO is an off-policy version of QPO that allows multiple updates of parameters during one simulation episode, leading to improved algorithm efficiency. Our numerical results indicate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
【专家简介】北京大学光华管理学院副教授,博士生导师。北京大学人工智能研究院、国家健康医疗大数据研究院兼职研究员。本科毕业于武汉大学数学与统计学院,从复旦大学管理学院获博士学位。在美国马里兰大学和乔治梅森大学分别从事过博士后与助理教授工作。主要研究方向包括仿真建模与优化、金融工程与风险管理、人工智能、健康医疗等。主持优秀青年科学基金、原创探索计划、杰出青年科学基金等。在《Operations Research》,《INFORMS Journal on Computing》和《IEEE Transactions on Automatic Control》等高质量期刊上发表学术论文,曾获INFORMS Outstanding Simulation Publication Award。目前担任Asia-Pacific Journal of Operational Research副主编、《系统管理学报》领域主编,全国工业统计学教学研究会金融科技与大数据分会副理事长、北京运筹学会副秘书长、管理科学与工程协会理事。