**Thursday** at **1:50-2:45 PM** (unless otherwise noted)

** Hybrid format**: either virtually or in **Math Tower Room 154** (See below for details)

For questions, contact Dr. Maria Han Veiga, Dr. Yulong Xing or Dr. Dongbin Xiu, Email: hanveiga dot 1@osu.edu, xing dot 205@osu.edu or xiu dot 16@osu.edu

- August 31
**François Ged**

Title: Matryoshka policy gradient for max-entropy reinforcement learning

DATE and TIME |
Location |
SPEAKER |
TITLE |

August 31
Thursday, 1:50pm |
In person
Math Tower 154 |
François Ged
(EPFL) |
Matryoshka policy gradient for max-entropy reinforcement learning |

Reinforcement Learning (RL) is the area of Machine Learning addressing tasks where an agent interacts with its environment through a sequence of actions, chosen according to its policy. The agent’s goal is to maximize the rewards collected along the way, and in this talk, entropy bonuses are added to the rewards. This regularization technique has become more common, with benefits such as: enhancement of the exploration of the environment, uniqueness and stochasticity of the optimal policy, and more robustness of the agent to adversarial modifications of the rewards. Policy gradient algorithms are well suited to deal with large (possibly infinite) state and action spaces but theoretical guarantees have been lacking or obtained in rather specific settings. The case of infinite (continuous) state and action spaces remains mostly unsolved. In this talk, I will present a novel algorithm called Matryoshka Policy Gradient (MPG) that is both very intuitive and mathematically tractable. It uses so-called softmax policies and relies on the following idea: by fixing in advance a maximal horizon N, the agent with MPG learns to optimize policies for all smaller horizons simultaneously, that is from 1 to N, in a nested way (recalling the image of Matryoshka dolls). Theoretically, under mild assumptions, our most important results can be summarized as follows: 1. training converges to the unique optimum when the optimum belongs to the parametric space; 2. training converges to an explicit orthogonal projection of the unique optimum when it does not belong to the parametric space, this projection being optimal within that space; 3. for policies parametrized by a neural network, we provide a simple sufficient criterion at convergence for the global optimality of the limit, in terms of the neural tangent kernel of the neural network. Most notably, these convergence guarantees hold for infinite continuous state and action spaces. Numerically, we confirm the potential of our algorithm by successfully training an agent on two basic standard benchmarks from Open AI Gym, namely, frozen lake and cart pole. No background in RL is needed to understand the talk. Based on joint work with Prof. Maria Han Veiga.

Year 2019-2020 Seminar

Does this page seem familiar?