TY - GEN
T1 - Constrained differentiable cross-entropy method for safe model-based reinforcement learning
AU - Mottahedi, Sam
AU - Pavlak, Gregory S.
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/11/9
Y1 - 2022/11/9
N2 - Reinforcement learning agents must explore their environments to learn optimal policies through trial and error. Due to challenges in simulating the complexities of the real world, there is a growing trend of training reinforcement learning (RL) agents directly in the real world instead of mostly or entirely in simulation. Safety concerns are paramount when training RL agents directly in the real world. This paper proposes MPC-CDCEM, a model-based reinforcement algorithm (RL) that allows the agent to safely interact with the environment and explore without additional assumptions on system dynamics. The algorithm uses a Model Predictive Control (MPC) framework with a differentiable cross-entropy optimizer, which induces a differentiable policy that considers the constraints while addressing the objective mismatch problem in model-based RL algorithms. We evaluate our algorithm in Safety Gym environments and on a practical building energy optimization problem. In addition, we showed that in both experiments, our algorithms have the lowest number of constraint violations and achieve comparable rewards compared to baseline constrained RL algorithms.
AB - Reinforcement learning agents must explore their environments to learn optimal policies through trial and error. Due to challenges in simulating the complexities of the real world, there is a growing trend of training reinforcement learning (RL) agents directly in the real world instead of mostly or entirely in simulation. Safety concerns are paramount when training RL agents directly in the real world. This paper proposes MPC-CDCEM, a model-based reinforcement algorithm (RL) that allows the agent to safely interact with the environment and explore without additional assumptions on system dynamics. The algorithm uses a Model Predictive Control (MPC) framework with a differentiable cross-entropy optimizer, which induces a differentiable policy that considers the constraints while addressing the objective mismatch problem in model-based RL algorithms. We evaluate our algorithm in Safety Gym environments and on a practical building energy optimization problem. In addition, we showed that in both experiments, our algorithms have the lowest number of constraint violations and achieve comparable rewards compared to baseline constrained RL algorithms.
UR - http://www.scopus.com/inward/record.url?scp=85144623382&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144623382&partnerID=8YFLogxK
U2 - 10.1145/3563357.3564055
DO - 10.1145/3563357.3564055
M3 - Conference contribution
AN - SCOPUS:85144623382
T3 - BuildSys 2022 - Proceedings of the 2022 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation
SP - 40
EP - 48
BT - BuildSys 2022 - Proceedings of the 2022 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation
PB - Association for Computing Machinery, Inc
T2 - 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, BuildSys 2022
Y2 - 9 November 2022 through 10 November 2022
ER -