CAREER: Securing Deep Reinforcement Learning

Project: Research project

Project Details


Like many other deep learning techniques, deep reinforcement learning is vulnerable to adversarial attacks. In reinforcement learning, an adversarial attack manipulates a reinforcement learning agent's sensory observation, flummoxing it. Recently, research has demonstrated that an adversarial attack could be even more practical. Instead of implicitly assuming an attacker has the full control to influence an agent's sensory system, the new type of attack presents an adversarial agent to manipulate the target agent's environment and thus trigger it to react in an undesired fashion. Compared with the kind of attack that alters the sensory observation, the new attack is more difficult to counteract. First, the methods (e.g., adversarial training) commonly used for robustifying other deep learning techniques are no longer suitable for deep reinforcement learning. Second, given a reinforcement learning agent, there are few technical approaches to scrutinizing the agent and unveiling its flaws.

This project intends to address these two significant problems by integrating and expanding upon a series of technical approaches used in explainable AI, adversarial training, and formal verification in conjunction with program synthesis. The basic idea is first to learn an adversarial agent informed by explainable AI. Using this learned agent, we then unveil the weakness of target agents and adversarially train them accordingly. Through a robustness check, we evaluate the enhanced agents. If a strengthened agent fails the adversary-resistance check, we fall back on formal verification and program synthesis techniques. Using this unified solution, reinforcement learning model developers could identify the policy flaws of reinforcement learning agents and effectively remediate their weaknesses. This project will provide a stack of technical solutions to scrutinizing and robustifying deep reinforcement learning. If successful, the project will significantly advance the field of AI security (for adversarial training and adversarial policy learning) and contribute to the field of machine learning (for explainable AI and verified AI). Besides, this project has the potential to improve the security of reinforcement learning applications significantly.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Effective start/end date11/15/219/30/26


  • National Science Foundation: $213,656.00
  • National Science Foundation: $554,468.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.